Information Retrieval Datasets

This section lists native IR dataset definitions. For access to hundreds more IR datasets, see IR-Datasets Integration (ir-datasets integration).

MS MARCO Passage

The MS MARCO (Microsoft Machine Reading Comprehension) Passage Ranking dataset. One of the most widely used benchmarks for neural IR research.

Contains ~8.8M passages and ~500K training queries with sparse relevance judgments.

Publication: Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, RanganMajumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchineReading COmprehension Dataset. In CoCo@NIPS.

See https://github.com/microsoft/MSMARCO-Passage-Ranking for more details

Dataset com.microsoft.msmarco.passage.collection_etc

datamaestro.data.Folder

Documents and some more files

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.collection

datamaestro_text.data.ir.csv.Documents

MS-Marco documents

This file contains each passage in the larger MSMARCO dataset.

Format is TSV (PID \t Passage)

Dataset com.microsoft.msmarco.passage.train_run

datamaestro_text.data.ir.csv.AdhocRunWithText

TSV format: qid, pid, query, passage

Dataset com.microsoft.msmarco.passage.train_queries

datamaestro_text.data.ir.csv.Topics

Dataset com.microsoft.msmarco.passage.train_qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset com.microsoft.msmarco.passage.train

datamaestro_text.data.ir.Adhoc

MS-Marco train dataset

Tasks: information retrieval, passage retrieval

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.train_withrun

datamaestro_text.data.ir.RerankAdhoc

MSMarco train dataset, including the top-1000 to documents to re-rank

Tasks: information retrieval, passage retrieval

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.train_idtriples

datamaestro_text.data.ir.TrainingTripletsLines

Full training triples (query, positive passage, negative passage) with IDs

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.train_texttriples_small

datamaestro_text.data.ir.TrainingTripletsLines

Small training triples (query, positive passage, negative passage) with text

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.train_texttriple_full

datamaestro_text.data.ir.TrainingTripletsLines

Full training triples (query, positive passage, negative passage) with text

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.dev_queries

datamaestro_text.data.ir.csv.Topics

Dataset com.microsoft.msmarco.passage.dev_run

datamaestro_text.data.ir.csv.AdhocRunWithText

Dataset com.microsoft.msmarco.passage.dev_qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset com.microsoft.msmarco.passage.dev

datamaestro_text.data.ir.Adhoc

MS-Marco dev dataset

Tasks: information retrieval, passage retrieval

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.dev_withrun

datamaestro_text.data.ir.RerankAdhoc

MSMarco dev dataset, including the top-1000 to documents to re-rank

Tasks: information retrieval, passage retrieval

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.eval_withrun

datamaestro_text.data.ir.csv.AdhocRunWithText

Dataset com.microsoft.msmarco.passage.dev_small_queries

datamaestro_text.data.ir.csv.Topics

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.dev_small_qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.dev_small

datamaestro_text.data.ir.Adhoc

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.eval_queries_small

datamaestro_text.data.ir.csv.Topics

External link: https://github.com/microsoft/MSMARCO-Passage-Ranking

Dataset com.microsoft.msmarco.passage.trec2019_test_queries

datamaestro_text.data.ir.csv.Topics

Dataset com.microsoft.msmarco.passage.trec2019_test_run

datamaestro_text.data.ir.csv.AdhocRunWithText

Dataset com.microsoft.msmarco.passage.trec2019_test_qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset com.microsoft.msmarco.passage.trec2019_test

datamaestro_text.data.ir.Adhoc

TREC Deep Learning (2019)

Tasks: information retrieval, passage retrieval

External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html

Dataset com.microsoft.msmarco.passage.trec2019_test_withrun

datamaestro_text.data.ir.RerankAdhoc

TREC Deep Learning (2019), including the top-1000 to documents to re-rank

Tasks: information retrieval, passage retrieval

External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html

Dataset com.microsoft.msmarco.passage.trec2020_test_queries

datamaestro_text.data.ir.csv.Topics

TREC Deep Learning 2019 (topics)

Topics of the TREC 2019 MS-Marco Deep Learning track

Dataset com.microsoft.msmarco.passage.trec2020_test_run

datamaestro_text.data.ir.csv.AdhocRunWithText

TREC Deep Learning (2020)

Tags: reranking

Tasks: information retrieval, passage retrieval

External link: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020.html

Set of query/passages for the passage re-ranking task re-rank (TREC 2020)

Example usage:

from datamaestro import prepare_dataset
from datamaestro.record import IDItem, TextItem

# Load the full adhoc dataset
adhoc = prepare_dataset("com.microsoft.msmarco.passage")

# Iterate over documents
for doc in adhoc.documents.iter_documents():
    doc_id = doc[IDItem].id
    text = doc[TextItem].text

# Load training triplets
triplets = prepare_dataset("com.microsoft.msmarco.passage.train.idstriples.small")
for triplet in triplets.iter():
    query = triplet.query
    pos_doc = triplet.positive
    neg_doc = triplet.negative

TIPSTER Collections

The TIPSTER document collections used in TREC evaluations, organized by source.

TIPSTER is sometimes also called the Text Research Collection Volume or TREC.

The TIPSTER project was sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections.

The detection data is comprised of a test collection built at NIST for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection consists of three CD-ROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.

See also https://trec.nist.gov/data/docs_eng.html and https://trec.nist.gov/data/intro_eng.html

Dataset gov.nist.trec.tipster.ap88

datamaestro_text.data.ir.trec.TipsterCollection

Associated Press document collection (1988)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ap89

datamaestro_text.data.ir.trec.TipsterCollection

Associated Press document collection (1989)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ap90

datamaestro_text.data.ir.trec.TipsterCollection

Associated Press document collection (1990)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.doe1

datamaestro_text.data.ir.trec.TipsterCollection

Department of Energy documents

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj87

datamaestro_text.data.ir.trec.TipsterCollection

Wall Street Journal (1987)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj88

datamaestro_text.data.ir.trec.TipsterCollection

Wall Street Journal (1988)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj89

datamaestro_text.data.ir.trec.TipsterCollection

Wall Street Journal (1989)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj90

datamaestro_text.data.ir.trec.TipsterCollection

Wall Street Journal (1990)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj91

datamaestro_text.data.ir.trec.TipsterCollection

Wall Street Journal (1991)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.wsj92

datamaestro_text.data.ir.trec.TipsterCollection

Wall Street Journal (1992)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.fr88

datamaestro_text.data.ir.trec.TipsterCollection

Federal Register (1988)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.fr89

datamaestro_text.data.ir.trec.TipsterCollection

Federal Register (1989)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.fr94

datamaestro_text.data.ir.trec.TipsterCollection

Federal Register (1994)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ziff1

datamaestro_text.data.ir.trec.TipsterCollection

Information from the Computer Select disks (1989-90)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ziff2

datamaestro_text.data.ir.trec.TipsterCollection

Information from the Computer Select disks (1989-90)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ziff3

datamaestro_text.data.ir.trec.TipsterCollection

Information from the Computer Select disks (1990-91)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.sjm1

datamaestro_text.data.ir.trec.TipsterCollection

San Jose Mercury News (1991)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.cr1

datamaestro_text.data.ir.trec.TipsterCollection

TODO

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.ft1

datamaestro_text.data.ir.trec.TipsterCollection

Financial Times

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.fbis1

datamaestro_text.data.ir.trec.TipsterCollection

Foreign Broadcast Information Service (1996)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

Dataset gov.nist.trec.tipster.la8990

datamaestro_text.data.ir.trec.TipsterCollection

Los Angeles Times (1989-90)

External link: https://catalog.ldc.upenn.edu/LDC93T3A

AQUAINT

The AQUAINT Corpus consists of newswire text data in English from three sources: Xinhua News Service, New York Times, and Associated Press.

Dataset edu.upenn.ldc.aquaint.apw

datamaestro_text.data.ir.trec.TipsterCollection

Associated Press (1998-2000)

External link: https://catalog.ldc.upenn.edu/LDC2002T31

Dataset edu.upenn.ldc.aquaint.nyt

datamaestro_text.data.ir.trec.TipsterCollection

New York Times (1998-2000)

External link: https://catalog.ldc.upenn.edu/LDC2002T31

Dataset edu.upenn.ldc.aquaint.xie

datamaestro_text.data.ir.trec.TipsterCollection

Xinhua News Agency newswires (1996-2000)

External link: https://catalog.ldc.upenn.edu/LDC2002T31

Dataset edu.upenn.ldc.aquaint

datamaestro_text.data.ir.trec.TipsterCollection

Aquaint documents

External link: https://catalog.ldc.upenn.edu/LDC2002T31

TREC Ad Hoc

Classic TREC Ad Hoc test collections from NIST. These collections have been fundamental benchmarks in IR research since the 1990s.

See https://trec.nist.gov/data/test_coll.html

Dataset gov.nist.trec.adhoc.1.documents

datamaestro_text.data.ir.trec.TipsterCollection

TREC-1 to TREC-3 documents (TIPSTER volumes 1 and 2)

Dataset gov.nist.trec.adhoc.1.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.1.assessments

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.1

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 1 (1992)

Dataset gov.nist.trec.adhoc.2.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.2.assessments

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.2

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 2 (1993)

Dataset gov.nist.trec.adhoc.3.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.3.assessments

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.3

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 3 (1994)

Dataset gov.nist.trec.adhoc.4.documents

datamaestro_text.data.ir.trec.TipsterCollection

TREC-4 documents

Dataset gov.nist.trec.adhoc.4.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.4.assessments

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.4

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 4 (1995)

Dataset gov.nist.trec.adhoc.5.documents

datamaestro_text.data.ir.trec.TipsterCollection

TREC-5 documents

Dataset gov.nist.trec.adhoc.5.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.5.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.5

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 5 (1996)

Dataset gov.nist.trec.adhoc.6.documents

datamaestro_text.data.ir.trec.TipsterCollection

TREC-5 documents

Dataset gov.nist.trec.adhoc.6.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.6.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.6

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 6 (1997)

Dataset gov.nist.trec.adhoc.7.documents

datamaestro_text.data.ir.trec.TipsterCollection

TREC-7 documents

Dataset gov.nist.trec.adhoc.7.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.7.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.7

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 3 (1994)

Dataset gov.nist.trec.adhoc.8.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.8.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.8

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC 8 (1999)

Dataset gov.nist.trec.adhoc.robust.2004.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.robust.2004.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.robust.2004

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC Robust (2004)

Dataset gov.nist.trec.adhoc.robust.2005.topics

datamaestro_text.data.ir.trec.TrecTopics

Dataset gov.nist.trec.adhoc.robust.2005.qrels

datamaestro_text.data.ir.trec.TrecAdhocAssessments

Dataset gov.nist.trec.adhoc.robust.2005

datamaestro_text.data.ir.Adhoc

Ad-hoc task of TREC Robust (2005)

Example usage:

from datamaestro import prepare_dataset

# Load TREC Adhoc dataset (e.g., TREC-8)
adhoc = prepare_dataset("gov.nist.trec.adhoc.8")

# Access components
documents = adhoc.documents
topics = adhoc.topics
assessments = adhoc.assessments