Information Retrieval Datasets
This section lists native IR dataset definitions. For access to hundreds more IR datasets, see IR-Datasets Integration (ir-datasets integration).
MS MARCO Passage
The MS MARCO (Microsoft Machine Reading Comprehension) Passage Ranking dataset. One of the most widely used benchmarks for neural IR research.
Contains ~8.8M passages and ~500K training queries with sparse relevance judgments.
Publication: Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, RanganMajumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchineReading COmprehension Dataset. In CoCo@NIPS.
See https://github.com/microsoft/MSMARCO-Passage-Ranking for more details
-
Dataset com.microsoft.msmarco.passage.collection_etc
datamaestro.data.Folder
Documents and some more files
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.collection
datamaestro_text.data.ir.csv.Documents
MS-Marco documents
This file contains each passage in the larger MSMARCO dataset.
Format is TSV (PID \t Passage)
-
Dataset com.microsoft.msmarco.passage.train_run
datamaestro_text.data.ir.csv.AdhocRunWithText
TSV format: qid, pid, query, passage
-
Dataset com.microsoft.msmarco.passage.train_queries
-
Dataset com.microsoft.msmarco.passage.train_qrels
-
Dataset com.microsoft.msmarco.passage.train
datamaestro_text.data.ir.Adhoc
MS-Marco train dataset
Tasks: information retrieval, passage retrieval
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.train_withrun
datamaestro_text.data.ir.RerankAdhoc
MSMarco train dataset, including the top-1000 to documents to re-rank
Tasks: information retrieval, passage retrieval
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.train_idtriples
datamaestro_text.data.ir.TrainingTripletsLines
Full training triples (query, positive passage, negative passage) with IDs
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.train_texttriples_small
datamaestro_text.data.ir.TrainingTripletsLines
Small training triples (query, positive passage, negative passage) with text
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.train_texttriple_full
datamaestro_text.data.ir.TrainingTripletsLines
Full training triples (query, positive passage, negative passage) with text
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.dev_queries
-
Dataset com.microsoft.msmarco.passage.dev_run
-
Dataset com.microsoft.msmarco.passage.dev_qrels
-
Dataset com.microsoft.msmarco.passage.dev
datamaestro_text.data.ir.Adhoc
MS-Marco dev dataset
Tasks: information retrieval, passage retrieval
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.dev_withrun
datamaestro_text.data.ir.RerankAdhoc
MSMarco dev dataset, including the top-1000 to documents to re-rank
Tasks: information retrieval, passage retrieval
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.eval_withrun
-
Dataset com.microsoft.msmarco.passage.dev_small_queries
datamaestro_text.data.ir.csv.Topics
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.dev_small_qrels
datamaestro_text.data.ir.trec.TrecAdhocAssessments
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.dev_small
datamaestro_text.data.ir.Adhoc
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.eval_queries_small
datamaestro_text.data.ir.csv.Topics
External link: https://github.com/microsoft/MSMARCO-Passage-Ranking
-
Dataset com.microsoft.msmarco.passage.trec2019_test_queries
-
Dataset com.microsoft.msmarco.passage.trec2019_test_run
-
Dataset com.microsoft.msmarco.passage.trec2019_test_qrels
-
Dataset com.microsoft.msmarco.passage.trec2019_test
datamaestro_text.data.ir.Adhoc
TREC Deep Learning (2019)
Tasks: information retrieval, passage retrieval
External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html
-
Dataset com.microsoft.msmarco.passage.trec2019_test_withrun
datamaestro_text.data.ir.RerankAdhoc
TREC Deep Learning (2019), including the top-1000 to documents to re-rank
Tasks: information retrieval, passage retrieval
External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html
-
Dataset com.microsoft.msmarco.passage.trec2020_test_queries
datamaestro_text.data.ir.csv.Topics
TREC Deep Learning 2019 (topics)
Topics of the TREC 2019 MS-Marco Deep Learning track
-
Dataset com.microsoft.msmarco.passage.trec2020_test_run
datamaestro_text.data.ir.csv.AdhocRunWithText
TREC Deep Learning (2020)
Tags: reranking
Tasks: information retrieval, passage retrieval
External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020.html
Set of query/passages for the passage re-ranking task re-rank (TREC 2020)
Example usage:
from datamaestro import prepare_dataset
from datamaestro.record import IDItem, TextItem
# Load the full adhoc dataset
adhoc = prepare_dataset("com.microsoft.msmarco.passage")
# Iterate over documents
for doc in adhoc.documents.iter_documents():
doc_id = doc[IDItem].id
text = doc[TextItem].text
# Load training triplets
triplets = prepare_dataset("com.microsoft.msmarco.passage.train.idstriples.small")
for triplet in triplets.iter():
query = triplet.query
pos_doc = triplet.positive
neg_doc = triplet.negative
TIPSTER Collections
The TIPSTER document collections used in TREC evaluations, organized by source.
TIPSTER is sometimes also called the Text Research Collection Volume or TREC.
The TIPSTER project was sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections.
The detection data is comprised of a test collection built at NIST for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection consists of three CD-ROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.
See also https://trec.nist.gov/data/docs_eng.html and https://trec.nist.gov/data/intro_eng.html
-
Dataset gov.nist.trec.tipster.ap88
datamaestro_text.data.ir.trec.TipsterCollection
Associated Press document collection (1988)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.ap89
datamaestro_text.data.ir.trec.TipsterCollection
Associated Press document collection (1989)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.ap90
datamaestro_text.data.ir.trec.TipsterCollection
Associated Press document collection (1990)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.doe1
datamaestro_text.data.ir.trec.TipsterCollection
Department of Energy documents
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.wsj87
datamaestro_text.data.ir.trec.TipsterCollection
Wall Street Journal (1987)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.wsj88
datamaestro_text.data.ir.trec.TipsterCollection
Wall Street Journal (1988)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.wsj89
datamaestro_text.data.ir.trec.TipsterCollection
Wall Street Journal (1989)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.wsj90
datamaestro_text.data.ir.trec.TipsterCollection
Wall Street Journal (1990)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.wsj91
datamaestro_text.data.ir.trec.TipsterCollection
Wall Street Journal (1991)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.wsj92
datamaestro_text.data.ir.trec.TipsterCollection
Wall Street Journal (1992)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.fr88
datamaestro_text.data.ir.trec.TipsterCollection
Federal Register (1988)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.fr89
datamaestro_text.data.ir.trec.TipsterCollection
Federal Register (1989)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.fr94
datamaestro_text.data.ir.trec.TipsterCollection
Federal Register (1994)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.ziff1
datamaestro_text.data.ir.trec.TipsterCollection
Information from the Computer Select disks (1989-90)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.ziff2
datamaestro_text.data.ir.trec.TipsterCollection
Information from the Computer Select disks (1989-90)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.ziff3
datamaestro_text.data.ir.trec.TipsterCollection
Information from the Computer Select disks (1990-91)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.sjm1
datamaestro_text.data.ir.trec.TipsterCollection
San Jose Mercury News (1991)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.cr1
datamaestro_text.data.ir.trec.TipsterCollection
TODO
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.ft1
datamaestro_text.data.ir.trec.TipsterCollection
Financial Times
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.fbis1
datamaestro_text.data.ir.trec.TipsterCollection
Foreign Broadcast Information Service (1996)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
-
Dataset gov.nist.trec.tipster.la8990
datamaestro_text.data.ir.trec.TipsterCollection
Los Angeles Times (1989-90)
External link: https://catalog.ldc.upenn.edu/LDC93T3A
AQUAINT
The AQUAINT Corpus consists of newswire text data in English from three sources: Xinhua News Service, New York Times, and Associated Press.
-
Dataset edu.upenn.ldc.aquaint.apw
datamaestro_text.data.ir.trec.TipsterCollection
Associated Press (1998-2000)
External link: https://catalog.ldc.upenn.edu/LDC2002T31
-
Dataset edu.upenn.ldc.aquaint.nyt
datamaestro_text.data.ir.trec.TipsterCollection
New York Times (1998-2000)
External link: https://catalog.ldc.upenn.edu/LDC2002T31
-
Dataset edu.upenn.ldc.aquaint.xie
datamaestro_text.data.ir.trec.TipsterCollection
Xinhua News Agency newswires (1996-2000)
External link: https://catalog.ldc.upenn.edu/LDC2002T31
-
Dataset edu.upenn.ldc.aquaint
datamaestro_text.data.ir.trec.TipsterCollection
Aquaint documents
External link: https://catalog.ldc.upenn.edu/LDC2002T31
TREC Ad Hoc
Classic TREC Ad Hoc test collections from NIST. These collections have been fundamental benchmarks in IR research since the 1990s.
See https://trec.nist.gov/data/test_coll.html
-
Dataset gov.nist.trec.adhoc.1.documents
datamaestro_text.data.ir.trec.TipsterCollection
TREC-1 to TREC-3 documents (TIPSTER volumes 1 and 2)
-
Dataset gov.nist.trec.adhoc.1.topics
-
Dataset gov.nist.trec.adhoc.1.assessments
-
Dataset gov.nist.trec.adhoc.1
datamaestro_text.data.ir.Adhoc
Ad-hoc task of TREC 1 (1992)
-
Dataset gov.nist.trec.adhoc.2.topics
-
Dataset gov.nist.trec.adhoc.2.assessments
-
Dataset gov.nist.trec.adhoc.2
datamaestro_text.data.ir.Adhoc
Ad-hoc task of TREC 2 (1993)
-
Dataset gov.nist.trec.adhoc.3.topics
-
Dataset gov.nist.trec.adhoc.3.assessments
-
Dataset gov.nist.trec.adhoc.3
datamaestro_text.data.ir.Adhoc
Ad-hoc task of TREC 3 (1994)
-
Dataset gov.nist.trec.adhoc.4.documents
datamaestro_text.data.ir.trec.TipsterCollection
TREC-4 documents
-
Dataset gov.nist.trec.adhoc.4.topics
-
Dataset gov.nist.trec.adhoc.4.assessments
-
Dataset gov.nist.trec.adhoc.4
datamaestro_text.data.ir.Adhoc
Ad-hoc task of TREC 4 (1995)
-
Dataset gov.nist.trec.adhoc.5.documents
datamaestro_text.data.ir.trec.TipsterCollection
TREC-5 documents
-
Dataset gov.nist.trec.adhoc.5.topics
-
Dataset gov.nist.trec.adhoc.5.qrels
-
Dataset gov.nist.trec.adhoc.5
datamaestro_text.data.ir.Adhoc
Ad-hoc task of TREC 5 (1996)
-
Dataset gov.nist.trec.adhoc.6.documents
datamaestro_text.data.ir.trec.TipsterCollection
TREC-5 documents
-
Dataset gov.nist.trec.adhoc.6.topics
-
Dataset gov.nist.trec.adhoc.6.qrels
-
Dataset gov.nist.trec.adhoc.6
datamaestro_text.data.ir.Adhoc
Ad-hoc task of TREC 6 (1997)
-
Dataset gov.nist.trec.adhoc.7.documents
datamaestro_text.data.ir.trec.TipsterCollection
TREC-7 documents
-
Dataset gov.nist.trec.adhoc.7.topics
-
Dataset gov.nist.trec.adhoc.7.qrels
-
Dataset gov.nist.trec.adhoc.7
datamaestro_text.data.ir.Adhoc
Ad-hoc task of TREC 3 (1994)
-
Dataset gov.nist.trec.adhoc.8.topics
-
Dataset gov.nist.trec.adhoc.8.qrels
-
Dataset gov.nist.trec.adhoc.8
datamaestro_text.data.ir.Adhoc
Ad-hoc task of TREC 8 (1999)
-
Dataset gov.nist.trec.adhoc.robust.2004.topics
-
Dataset gov.nist.trec.adhoc.robust.2004.qrels
-
Dataset gov.nist.trec.adhoc.robust.2004
datamaestro_text.data.ir.Adhoc
Ad-hoc task of TREC Robust (2004)
-
Dataset gov.nist.trec.adhoc.robust.2005.topics
-
Dataset gov.nist.trec.adhoc.robust.2005.qrels
-
Dataset gov.nist.trec.adhoc.robust.2005
datamaestro_text.data.ir.Adhoc
Ad-hoc task of TREC Robust (2005)
Example usage:
from datamaestro import prepare_dataset
# Load TREC Adhoc dataset (e.g., TREC-8)
adhoc = prepare_dataset("gov.nist.trec.adhoc.8")
# Access components
documents = adhoc.documents
topics = adhoc.topics
assessments = adhoc.assessments