Information Retrieval Datasets

This section lists native IR dataset definitions. For access to hundreds more IR datasets, see IR-Datasets Integration (ir-datasets integration).

MS MARCO Passage

The MS MARCO (Microsoft Machine Reading Comprehension) Passage Ranking dataset. One of the most widely used benchmarks for neural IR research.

Contains ~8.8M passages and ~500K training queries with sparse relevance judgments.

MS MARCO (Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking. A variant of this task will be the part of TREC and AFIRM 2019. For Updates about TREC 2019 please follow This Repository Passage Reranking task Task Given a query q and a the 1000 most relevant passages P = p1, p2, p3,… p1000, as retrieved by BM25 a succeful system is expected to rerank the most relevant passage as high as possible. For this task not all 1000 relevant items have a human labeled relevant passage. Evaluation will be done using MRR.

Publication: Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, RanganMajumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchineReading COmprehension Dataset. In CoCo@NIPS.

See [https://github.com/microsoft/MSMARCO-Passage-Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) for more details

Dataset com.microsoft.msmarco.passage.collection.etc

datamaestro.data.Folder

Documents and some more files

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.collection

datamaestro_text.data.ir.csv.Documents

MS-Marco documents

This file contains each passage in the larger MSMARCO dataset.

Format is TSV (PID t Passage)

Dataset com.microsoft.msmarco.passage.train.run

datamaestro_text.data.ir.csv.AdhocRunWithText

TSV format: qid, pid, query, passage

Dataset com.microsoft.msmarco.passage.train.queries

datamaestro_text.data.ir.csv.Topics

Dataset com.microsoft.msmarco.passage.train.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset com.microsoft.msmarco.passage.train

datamaestro_text.data.ir.Adhoc

MS-Marco train dataset

Tasks: information retrieval, passage retrieval

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.train.withrun

datamaestro_text.data.ir.RerankAdhoc

MSMarco train dataset, including the top-1000 to documents to re-rank

Tasks: information retrieval, passage retrieval

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.train.idtriples

datamaestro_text.data.ir.TrainingTripletsLines

Full training triples (query, positive passage, negative passage) with IDs

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.train.texttriples.small

datamaestro_text.data.ir.TrainingTripletsLines

Small training triples (query, positive passage, negative passage) with text

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.train.texttriple.full

datamaestro_text.data.ir.TrainingTripletsLines

Full training triples (query, positive passage, negative passage) with text

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.dev.queries

datamaestro_text.data.ir.csv.Topics

Dataset com.microsoft.msmarco.passage.dev.run

datamaestro_text.data.ir.csv.AdhocRunWithText

Dataset com.microsoft.msmarco.passage.dev.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset com.microsoft.msmarco.passage.dev

datamaestro_text.data.ir.Adhoc

MS-Marco dev dataset

Tasks: information retrieval, passage retrieval

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.dev.withrun

datamaestro_text.data.ir.RerankAdhoc

MSMarco dev dataset, including the top-1000 to documents to re-rank

Tasks: information retrieval, passage retrieval

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.eval.withrun

datamaestro_text.data.ir.csv.AdhocRunWithText

Dataset com.microsoft.msmarco.passage.dev.small.queries

datamaestro_text.data.ir.csv.Topics

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.dev.small.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.dev.small

datamaestro_text.data.ir.Adhoc

Tasks: information retrieval

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.eval.queries.small

datamaestro_text.data.ir.csv.Topics

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.trec2019.test.queries

datamaestro_text.data.ir.csv.Topics

Dataset com.microsoft.msmarco.passage.trec2019.test.run

datamaestro_text.data.ir.csv.AdhocRunWithText

Dataset com.microsoft.msmarco.passage.trec2019.test.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset com.microsoft.msmarco.passage.trec2019.test

datamaestro_text.data.ir.Adhoc

TREC Deep Learning (2019)

Tasks: information retrieval, passage retrieval

External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html

Dataset com.microsoft.msmarco.passage.trec2019.test.withrun

datamaestro_text.data.ir.RerankAdhoc

TREC Deep Learning (2019), including the top-1000 to documents to re-rank

Tasks: information retrieval, passage retrieval

External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html

Dataset com.microsoft.msmarco.passage.trec2020.test.queries

datamaestro_text.data.ir.csv.Topics

TREC Deep Learning 2019 (topics)

Topics of the TREC 2019 MS-Marco Deep Learning track

Dataset com.microsoft.msmarco.passage.trec2020.test.run

datamaestro_text.data.ir.csv.AdhocRunWithText

TREC Deep Learning (2020)

Tags: reranking

Tasks: information retrieval, passage retrieval

External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020.html

Set of query/passages for the passage re-ranking task re-rank (TREC 2020)

Example usage:

from datamaestro import prepare_dataset
from datamaestro.record import IDItem, TextItem

# Load the full adhoc dataset
adhoc = prepare_dataset("com.microsoft.msmarco.passage")

# Iterate over documents
for doc in adhoc.documents.iter_documents():
    doc_id = doc[IDItem].id
    text = doc[TextItem].text

# Load training triplets
triplets = prepare_dataset("com.microsoft.msmarco.passage.train.idstriples.small")
for triplet in triplets.iter():
    query = triplet.query
    pos_doc = triplet.positive
    neg_doc = triplet.negative

TIPSTER Collections

The TIPSTER document collections used in TREC evaluations, organized by source.

gov.nist.trec.tipster

TIPSTER is sometimes also called the Text Research Collection Volume or TREC.

The TIPSTER project was sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections.

The detection data is comprised of a test collection built at NIST for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection consists of three CD-ROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.

See also https://trec.nist.gov/data/docs_eng.html and https://trec.nist.gov/data/intro_eng.html

Dataset gov.nist.trec.tipster.ap88

datamaestro_text.data.ir.trec.TipsterCollection

Associated Press document collection (1988)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ap89

datamaestro_text.data.ir.trec.TipsterCollection

Associated Press document collection (1989)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ap90

datamaestro_text.data.ir.trec.TipsterCollection

Associated Press document collection (1990)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.doe1

datamaestro_text.data.ir.trec.TipsterCollection

Department of Energy documents

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj87

datamaestro_text.data.ir.trec.TipsterCollection

Wall Street Journal (1987)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj88

datamaestro_text.data.ir.trec.TipsterCollection

Wall Street Journal (1988)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj89

datamaestro_text.data.ir.trec.TipsterCollection

Wall Street Journal (1989)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj90

datamaestro_text.data.ir.trec.TipsterCollection

Wall Street Journal (1990)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj91

datamaestro_text.data.ir.trec.TipsterCollection

Wall Street Journal (1991)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj92

datamaestro_text.data.ir.trec.TipsterCollection

Wall Street Journal (1992)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.fr88

datamaestro_text.data.ir.trec.TipsterCollection

Federal Register (1988)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.fr89

datamaestro_text.data.ir.trec.TipsterCollection

Federal Register (1989)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.fr94

datamaestro_text.data.ir.trec.TipsterCollection

Federal Register (1994)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ziff1

datamaestro_text.data.ir.trec.TipsterCollection

Information from the Computer Select disks (1989-90)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ziff2

datamaestro_text.data.ir.trec.TipsterCollection

Information from the Computer Select disks (1989-90)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ziff3

datamaestro_text.data.ir.trec.TipsterCollection

Information from the Computer Select disks (1990-91)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.sjm1

datamaestro_text.data.ir.trec.TipsterCollection

San Jose Mercury News (1991)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.cr1

datamaestro_text.data.ir.trec.TipsterCollection

TODO

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ft1

datamaestro_text.data.ir.trec.TipsterCollection

Financial Times

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.fbis1

datamaestro_text.data.ir.trec.TipsterCollection

Foreign Broadcast Information Service (1996)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.la8990

datamaestro_text.data.ir.trec.TipsterCollection

Los Angeles Times (1989-90)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

AQUAINT

The AQUAINT Corpus consists of newswire text data in English from three sources: Xinhua News Service, New York Times, and Associated Press.

The AQUAINT Corpus, Linguistic Data Consortium (LDC) catalog number LDC2002T31 and ISBN 1-58563-240-6 consists of newswire text data in English, drawn from three sources: the Xinhua News Service (People’s Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service. It was prepared by the LDC for the AQUAINT Project, and will be used in official benchmark evaluations conducted by National Institute of Standards and Technology (NIST).

Dataset edu.upenn.ldc.aquaint.apw

datamaestro_text.data.ir.trec.TipsterCollection

Associated Press (1998-2000)

External link: https://catalog.ldc.upenn.edu/LDC2002T31

Dataset edu.upenn.ldc.aquaint.nyt

datamaestro_text.data.ir.trec.TipsterCollection

New York Times (1998-2000)

External link: https://catalog.ldc.upenn.edu/LDC2002T31

Dataset edu.upenn.ldc.aquaint.xie

datamaestro_text.data.ir.trec.TipsterCollection

Xinhua News Agency newswires (1996-2000)

External link: https://catalog.ldc.upenn.edu/LDC2002T31

Dataset edu.upenn.ldc.aquaint

datamaestro_text.data.ir.trec.TipsterCollection

Aquaint documents

External link: https://catalog.ldc.upenn.edu/LDC2002T31

TREC Ad Hoc

Classic TREC Ad Hoc test collections from NIST. These collections have been fundamental benchmarks in IR research since the 1990s.

TREC Adhoc datasets and tasks

See [https://trec.nist.gov/data/test_coll.html](https://trec.nist.gov/data/test_coll.html)

Dataset gov.nist.trec.adhoc.1.documents

datamaestro_text.data.ir.trec.TipsterCollection

TREC-1 to TREC-3 documents (TIPSTER volumes 1 and 2)

Dataset gov.nist.trec.adhoc.1.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.1.assessments

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.1

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 1 (1992)

Tasks: information retrieval

Dataset gov.nist.trec.adhoc.2.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.2.assessments

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.2

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 2 (1993)

Tasks: information retrieval

Dataset gov.nist.trec.adhoc.3.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.3.assessments

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.3

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 3 (1994)

Tasks: information retrieval

Dataset gov.nist.trec.adhoc.4.documents

datamaestro_text.data.ir.trec.TipsterCollection

TREC-4 documents

Dataset gov.nist.trec.adhoc.4.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.4.assessments

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.4

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 4 (1995)

Tasks: information retrieval

Dataset gov.nist.trec.adhoc.5.documents

datamaestro_text.data.ir.trec.TipsterCollection

TREC-5 documents

Dataset gov.nist.trec.adhoc.5.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.5.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.5

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 5 (1996)

Tasks: information retrieval

Dataset gov.nist.trec.adhoc.6.documents

datamaestro_text.data.ir.trec.TipsterCollection

TREC-5 documents

Dataset gov.nist.trec.adhoc.6.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.6.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.6

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 6 (1997)

Tasks: information retrieval

Dataset gov.nist.trec.adhoc.7.documents

datamaestro_text.data.ir.trec.TipsterCollection

TREC-7 documents

Dataset gov.nist.trec.adhoc.7.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.7.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.7

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 3 (1994)

Tasks: information retrieval

Dataset gov.nist.trec.adhoc.8.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.8.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.8

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 8 (1999)

Tasks: information retrieval

Dataset gov.nist.trec.adhoc.robust.2004.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.robust.2004.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.robust.2004

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC Robust (2004)

Tasks: information retrieval

Dataset gov.nist.trec.adhoc.robust.2005.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.robust.2005.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.robust.2005

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC Robust (2005)

Tasks: information retrieval

Example usage:

from datamaestro import prepare_dataset

# Load TREC Adhoc dataset (e.g., TREC-8)
adhoc = prepare_dataset("gov.nist.trec.adhoc.8")

# Access components
documents = adhoc.documents
topics = adhoc.topics
assessments = adhoc.assessments