IR-Datasets Integration<a class="headerlink" href="#ir-datasets-integration" title="Link to this heading">

"ANTIQUE is a non-factoid quesiton answering dataset based on the questions and answers of Yahoo! Webscope L6."

Documents: Short answer passages (from Yahoo Answers)
Queries: Natural language questions (from Yahoo Answers)
Dataset Paper

Dataset irds.antique.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official test set of the ANTIQUE dataset.

Dataset irds.antique.test.qrels

Official test set of the ANTIQUE dataset.

Dataset irds.antique.test

Official test set of the ANTIQUE dataset.

Dataset irds.antique.test.non-offensive.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

antique/test without a set of queries deemed by the authors of ANTIQUE to be "offensive (and noisy)."

Dataset irds.antique.test.non-offensive.qrels

antique/test without a set of queries deemed by the authors of ANTIQUE to be "offensive (and noisy)."

Dataset irds.antique.test.non-offensive

antique/test without a set of queries deemed by the authors of ANTIQUE to be "offensive (and noisy)."

Dataset irds.antique.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official train set of the ANTIQUE dataset.

Dataset irds.antique.train.qrels

Official train set of the ANTIQUE dataset.

Dataset irds.antique.train

Official train set of the ANTIQUE dataset.

Dataset irds.antique.train.split200-train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

antique/train without the 200 queries used by antique/train/split200-valid.

Dataset irds.antique.train.split200-train.qrels

antique/train without the 200 queries used by antique/train/split200-valid.

Dataset irds.antique.train.split200-train

antique/train without the 200 queries used by antique/train/split200-valid.

Dataset irds.antique.train.split200-valid.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A held-out subset of 200 queries from antique/train. Use in conjunction with antique/train/split200-train.

Dataset irds.antique.train.split200-valid.qrels

A held-out subset of 200 queries from antique/train. Use in conjunction with antique/train/split200-train.

Dataset irds.antique.train.split200-valid

A held-out subset of 200 queries from antique/train. Use in conjunction with antique/train/split200-train.

AOL-IA (Internet Archive)

This is a version of the AOL Query Log. Documents use versions that appeared around the time of the query log (early 2006) via the Internet Archive.

The query log does not include document or query IDs. These are instead created by ir_datasets. Document IDs are assigned using a hash of the URL that appears in the query log. Query IDs are assigned using the a hash of the noramlised query. All unique normalized queries are available from queries, and all clicked documents are available from qrels (iteration value set to the user ID). Full information (including original query) are available from qlogs.

Dataset irds.aol-ia.documents

This is a version of the AOL Query Log. Documents use versions that appeared around the time of the query log (early 2006) via the Internet Archive.

The query log does not include document or query IDs. These are instead created by ir_datasets. Document IDs are assigned using a hash of the URL that appears in the query log. Query IDs are assigned using the a hash of the noramlised query. All unique normalized queries are available from queries, and all clicked documents are available from qrels (iteration value set to the user ID). Full information (including original query) are available from qlogs.

Dataset irds.aol-ia.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

This is a version of the AOL Query Log. Documents use versions that appeared around the time of the query log (early 2006) via the Internet Archive.

The query log does not include document or query IDs. These are instead created by ir_datasets. Document IDs are assigned using a hash of the URL that appears in the query log. Query IDs are assigned using the a hash of the noramlised query. All unique normalized queries are available from queries, and all clicked documents are available from qrels (iteration value set to the user ID). Full information (including original query) are available from qlogs.

Dataset irds.aol-ia.qrels

This is a version of the AOL Query Log. Documents use versions that appeared around the time of the query log (early 2006) via the Internet Archive.

The query log does not include document or query IDs. These are instead created by ir_datasets. Document IDs are assigned using a hash of the URL that appears in the query log. Query IDs are assigned using the a hash of the noramlised query. All unique normalized queries are available from queries, and all clicked documents are available from qrels (iteration value set to the user ID). Full information (including original query) are available from qlogs.

Dataset irds.aol-ia

This is a version of the AOL Query Log. Documents use versions that appeared around the time of the query log (early 2006) via the Internet Archive.

The query log does not include document or query IDs. These are instead created by ir_datasets. Document IDs are assigned using a hash of the URL that appears in the query log. Query IDs are assigned using the a hash of the noramlised query. All unique normalized queries are available from queries, and all clicked documents are available from qrels (iteration value set to the user ID). Full information (including original query) are available from qlogs.

AQUAINT

A document collection of about 1M English newswire text. Sources are the Xinhua News Service (People's Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service.

Dataset details

Dataset irds.aquaint.documents

A document collection of about 1M English newswire text. Sources are the Xinhua News Service (People's Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service.

Dataset details

Dataset irds.aquaint.trec-robust-2005.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Robust 2005 dataset. Contains a subset of 50 "hard" queries from trec-robust04.

Documents: News articles
Queries: keyword queries, descriptions, narratives
Relevance: Deep judgments
Shared task site
Task overview paper
See also: trec-robust04

Dataset irds.aquaint.trec-robust-2005.qrels

The TREC Robust 2005 dataset. Contains a subset of 50 "hard" queries from trec-robust04.

Documents: News articles
Queries: keyword queries, descriptions, narratives
Relevance: Deep judgments
Shared task site
Task overview paper
See also: trec-robust04

Dataset irds.aquaint.trec-robust-2005

The TREC Robust 2005 dataset. Contains a subset of 50 "hard" queries from trec-robust04.

Documents: News articles
Queries: keyword queries, descriptions, narratives
Relevance: Deep judgments
Shared task site
Task overview paper
See also: trec-robust04

args.me version 1.0

Corpus version 1.0 with 387 606 arguments crawled from Debatewise, IDebate.org, Debatepedia, Debate.org. It was released on July 9, 2019 on Zenodo. The cleaned version argsme/1.0-cleaned should be preferred.

This collection is licensed with the Creative Commons Attribution 4.0 International. Individual rights to the content still apply.

Dataset irds.argsme.1.0.documents

Corpus version 1.0 with 387 606 arguments crawled from Debatewise, IDebate.org, Debatepedia, Debate.org. It was released on July 9, 2019 on Zenodo. The cleaned version argsme/1.0-cleaned should be preferred.

This collection is licensed with the Creative Commons Attribution 4.0 International. Individual rights to the content still apply.

Dataset irds.argsme.1.0.touche-2020-task-1.uncorrected.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of argsme/2020-04-01/touche-2020-task-1 that uses the argsme/1.0 corpus with uncorrected relevance judgements derived from crowdworkers. This dataset's relevance judgements should not be used without preprocessing.

Dataset irds.argsme.1.0.touche-2020-task-1.uncorrected.qrels

Version of argsme/2020-04-01/touche-2020-task-1 that uses the argsme/1.0 corpus with uncorrected relevance judgements derived from crowdworkers. This dataset's relevance judgements should not be used without preprocessing.

Dataset irds.argsme.1.0.touche-2020-task-1.uncorrected

Version of argsme/2020-04-01/touche-2020-task-1 that uses the argsme/1.0 corpus with uncorrected relevance judgements derived from crowdworkers. This dataset's relevance judgements should not be used without preprocessing.

args.me version 1.0 cleaned

Corpus version 1.0-cleaned with 382 545 arguments crawled from Debatewise, IDebate.org, Debatepedia, Debate.org. This version contains the same arguments as argsme/1.0, but was cleaned as described in the corresponding publication. It was released on October 27, 2020 on Zenodo.

This collection is licensed with the Creative Commons Attribution 4.0 International. Individual rights to the content still apply.

Dataset irds.argsme.1.0-cleaned.documents

Corpus version 1.0-cleaned with 382 545 arguments crawled from Debatewise, IDebate.org, Debatepedia, Debate.org. This version contains the same arguments as argsme/1.0, but was cleaned as described in the corresponding publication. It was released on October 27, 2020 on Zenodo.

This collection is licensed with the Creative Commons Attribution 4.0 International. Individual rights to the content still apply.

argsme/2020-04-01/debateorg

Subset of the 338 620 arguments from argsme/2020-04-01 that were crawled from the debate portal Debate.org.

Dataset irds.argsme.2020-04-01.debateorg.documents

Subset of the 338 620 arguments from argsme/2020-04-01 that were crawled from the debate portal Debate.org.

argsme/2020-04-01/debatepedia

Subset of the 21 197 arguments from argsme/2020-04-01 that were crawled from the debate portal Debatepedia.

Dataset irds.argsme.2020-04-01.debatepedia.documents

Subset of the 21 197 arguments from argsme/2020-04-01 that were crawled from the debate portal Debatepedia.

argsme/2020-04-01/debatewise

Subset of the 14 353 arguments from argsme/2020-04-01 that were crawled from the debate portal Debatewise.

Dataset irds.argsme.2020-04-01.debatewise.documents

Subset of the 14 353 arguments from argsme/2020-04-01 that were crawled from the debate portal Debatewise.

argsme/2020-04-01/idebate

Subset of the 13 522 arguments from argsme/2020-04-01 that were crawled from the debate portal IDebate.org.

Dataset irds.argsme.2020-04-01.idebate.documents

Subset of the 13 522 arguments from argsme/2020-04-01 that were crawled from the debate portal IDebate.org.

argsme/2020-04-01/parliamentary

Subset of the 48 arguments from argsme/2020-04-01 that were crawled from Canadian Parliament discussions.

Dataset irds.argsme.2020-04-01.parliamentary.documents

Subset of the 48 arguments from argsme/2020-04-01 that were crawled from Canadian Parliament discussions.

argsme/2020-04-01/processed

Pre-processed version of argsme/2020-04-01 where each argument is split into sentences.

Dataset irds.argsme.2020-04-01.processed.documents

Pre-processed version of argsme/2020-04-01 where each argument is split into sentences.

Dataset irds.argsme.2020-04-01.processed.touche-2022-task-1.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2022 is the third lab on argument retrieval at CLEF 2022 featuring three tasks.

Given a query about a controversial topic, retrieve and rank a relevant pair of sentences from a collection of arguments (argsme/2020-04-01-processed).

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

Dataset irds.argsme.2020-04-01.processed.touche-2022-task-1.qrels

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2022 is the third lab on argument retrieval at CLEF 2022 featuring three tasks.

Given a query about a controversial topic, retrieve and rank a relevant pair of sentences from a collection of arguments (argsme/2020-04-01-processed).

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

Dataset irds.argsme.2020-04-01.processed.touche-2022-task-1

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2022 is the third lab on argument retrieval at CLEF 2022 featuring three tasks.

Given a query about a controversial topic, retrieve and rank a relevant pair of sentences from a collection of arguments (argsme/2020-04-01-processed).

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

args.me

Corpus version 2020-04-01 with 387 740 arguments crawled from Debatewise, IDebate.org, Debatepedia, Debate.org, and from Canadian Parliament discussions. It was released on April 1, 2020 on Zenodo.

This collection is licensed with the Creative Commons Attribution 4.0 International. Individual rights to the content still apply.

Dataset irds.argsme.2020-04-01.documents

Corpus version 2020-04-01 with 387 740 arguments crawled from Debatewise, IDebate.org, Debatepedia, Debate.org, and from Canadian Parliament discussions. It was released on April 1, 2020 on Zenodo.

This collection is licensed with the Creative Commons Attribution 4.0 International. Individual rights to the content still apply.

Dataset irds.argsme.2020-04-01.touche-2020-task-1.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Decision making processes, be it at the societal or at the personal level, eventually come to a point where one side will challenge the other with a why-question, which is a prompt to justify one's stance. Thus, technologies for argument mining and argumentation processing are maturing at a rapid pace, giving rise for the first time to argument retrieval. Touché 2020 is the first lab on Argument Retrieval at CLEF 2020 featuring two tasks.

Given a question on a controversial topic, retrieve relevant arguments from a focused crawl of online debate portals (argsme/2020-04-01).

Documents are judged based on their general topical relevance.

Dataset irds.argsme.2020-04-01.touche-2020-task-1.qrels

Decision making processes, be it at the societal or at the personal level, eventually come to a point where one side will challenge the other with a why-question, which is a prompt to justify one's stance. Thus, technologies for argument mining and argumentation processing are maturing at a rapid pace, giving rise for the first time to argument retrieval. Touché 2020 is the first lab on Argument Retrieval at CLEF 2020 featuring two tasks.

Given a question on a controversial topic, retrieve relevant arguments from a focused crawl of online debate portals (argsme/2020-04-01).

Documents are judged based on their general topical relevance.

Dataset irds.argsme.2020-04-01.touche-2020-task-1

Decision making processes, be it at the societal or at the personal level, eventually come to a point where one side will challenge the other with a why-question, which is a prompt to justify one's stance. Thus, technologies for argument mining and argumentation processing are maturing at a rapid pace, giving rise for the first time to argument retrieval. Touché 2020 is the first lab on Argument Retrieval at CLEF 2020 featuring two tasks.

Given a question on a controversial topic, retrieve relevant arguments from a focused crawl of online debate portals (argsme/2020-04-01).

Documents are judged based on their general topical relevance.

Dataset irds.argsme.2020-04-01.touche-2021-task-1.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2021 is the second lab on argument retrieval at CLEF 2021 featuring two tasks.

Given a question on a controversial topic, retrieve relevant arguments from a focused crawl of online debate portals (argsme/2020-04-01).

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

Dataset irds.argsme.2020-04-01.touche-2021-task-1.qrels

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2021 is the second lab on argument retrieval at CLEF 2021 featuring two tasks.

Given a question on a controversial topic, retrieve relevant arguments from a focused crawl of online debate portals (argsme/2020-04-01).

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

Dataset irds.argsme.2020-04-01.touche-2021-task-1

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2021 is the second lab on argument retrieval at CLEF 2021 featuring two tasks.

Given a question on a controversial topic, retrieve relevant arguments from a focused crawl of online debate portals (argsme/2020-04-01).

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

Dataset irds.argsme.2020-04-01.touche-2020-task-1.uncorrected.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of argsme/2020-04-01/touche-2020-task-1 that uses uncorrected relevance judgements derived from crowdworkers. This dataset's relevance judgements should not be used without preprocessing.

Dataset irds.argsme.2020-04-01.touche-2020-task-1.uncorrected.qrels

Version of argsme/2020-04-01/touche-2020-task-1 that uses uncorrected relevance judgements derived from crowdworkers. This dataset's relevance judgements should not be used without preprocessing.

Dataset irds.argsme.2020-04-01.touche-2020-task-1.uncorrected

Version of argsme/2020-04-01/touche-2020-task-1 that uses uncorrected relevance judgements derived from crowdworkers. This dataset's relevance judgements should not be used without preprocessing.

beir/arguana

A version of the ArguAna Counterargs dataset, for argument retrieval.

Dataset irds.beir.arguana.documents

A version of the ArguAna Counterargs dataset, for argument retrieval.

Dataset irds.beir.arguana.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the ArguAna Counterargs dataset, for argument retrieval.

Dataset irds.beir.arguana.qrels

A version of the ArguAna Counterargs dataset, for argument retrieval.

Dataset irds.beir.arguana

A version of the ArguAna Counterargs dataset, for argument retrieval.

beir/climate-fever

A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.

Dataset irds.beir.climate-fever.documents

A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.

Dataset irds.beir.climate-fever.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.

Dataset irds.beir.climate-fever.qrels

A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.

Dataset irds.beir.climate-fever

A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.

beir/cqadupstack/android

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the android StackExchange subforum.

Dataset irds.beir.cqadupstack.android.documents

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the android StackExchange subforum.

Dataset irds.beir.cqadupstack.android.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the android StackExchange subforum.

Dataset irds.beir.cqadupstack.android.qrels

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the android StackExchange subforum.

Dataset irds.beir.cqadupstack.android

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the android StackExchange subforum.

beir/cqadupstack/english

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the english StackExchange subforum.

Dataset irds.beir.cqadupstack.english.documents

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the english StackExchange subforum.

Dataset irds.beir.cqadupstack.english.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the english StackExchange subforum.

Dataset irds.beir.cqadupstack.english.qrels

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the english StackExchange subforum.

Dataset irds.beir.cqadupstack.english

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the english StackExchange subforum.

beir/cqadupstack/gaming

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gaming StackExchange subforum.

Dataset irds.beir.cqadupstack.gaming.documents

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gaming StackExchange subforum.

Dataset irds.beir.cqadupstack.gaming.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gaming StackExchange subforum.

Dataset irds.beir.cqadupstack.gaming.qrels

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gaming StackExchange subforum.

Dataset irds.beir.cqadupstack.gaming

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gaming StackExchange subforum.

beir/cqadupstack/gis

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gis StackExchange subforum.

Dataset irds.beir.cqadupstack.gis.documents

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gis StackExchange subforum.

Dataset irds.beir.cqadupstack.gis.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gis StackExchange subforum.

Dataset irds.beir.cqadupstack.gis.qrels

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gis StackExchange subforum.

Dataset irds.beir.cqadupstack.gis

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gis StackExchange subforum.

beir/cqadupstack/mathematica

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the mathematica StackExchange subforum.

Dataset irds.beir.cqadupstack.mathematica.documents

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the mathematica StackExchange subforum.

Dataset irds.beir.cqadupstack.mathematica.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the mathematica StackExchange subforum.

Dataset irds.beir.cqadupstack.mathematica.qrels

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the mathematica StackExchange subforum.

Dataset irds.beir.cqadupstack.mathematica

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the mathematica StackExchange subforum.

beir/cqadupstack/physics

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the physics StackExchange subforum.

Dataset irds.beir.cqadupstack.physics.documents

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the physics StackExchange subforum.

Dataset irds.beir.cqadupstack.physics.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the physics StackExchange subforum.

Dataset irds.beir.cqadupstack.physics.qrels

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the physics StackExchange subforum.

Dataset irds.beir.cqadupstack.physics

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the physics StackExchange subforum.

beir/cqadupstack/programmers

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the programmers StackExchange subforum.

Dataset irds.beir.cqadupstack.programmers.documents

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the programmers StackExchange subforum.

Dataset irds.beir.cqadupstack.programmers.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the programmers StackExchange subforum.

Dataset irds.beir.cqadupstack.programmers.qrels

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the programmers StackExchange subforum.

Dataset irds.beir.cqadupstack.programmers

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the programmers StackExchange subforum.

beir/cqadupstack/stats

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the stats StackExchange subforum.

Dataset irds.beir.cqadupstack.stats.documents

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the stats StackExchange subforum.

Dataset irds.beir.cqadupstack.stats.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the stats StackExchange subforum.

Dataset irds.beir.cqadupstack.stats.qrels

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the stats StackExchange subforum.

Dataset irds.beir.cqadupstack.stats

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the stats StackExchange subforum.

beir/cqadupstack/tex

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the tex StackExchange subforum.

Dataset irds.beir.cqadupstack.tex.documents

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the tex StackExchange subforum.

Dataset irds.beir.cqadupstack.tex.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the tex StackExchange subforum.

Dataset irds.beir.cqadupstack.tex.qrels

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the tex StackExchange subforum.

Dataset irds.beir.cqadupstack.tex

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the tex StackExchange subforum.

beir/cqadupstack/unix

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the unix StackExchange subforum.

Dataset irds.beir.cqadupstack.unix.documents

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the unix StackExchange subforum.

Dataset irds.beir.cqadupstack.unix.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the unix StackExchange subforum.

Dataset irds.beir.cqadupstack.unix.qrels

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the unix StackExchange subforum.

Dataset irds.beir.cqadupstack.unix

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the unix StackExchange subforum.

beir/cqadupstack/webmasters

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the webmasters StackExchange subforum.

Dataset irds.beir.cqadupstack.webmasters.documents

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the webmasters StackExchange subforum.

Dataset irds.beir.cqadupstack.webmasters.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the webmasters StackExchange subforum.

Dataset irds.beir.cqadupstack.webmasters.qrels

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the webmasters StackExchange subforum.

Dataset irds.beir.cqadupstack.webmasters

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the webmasters StackExchange subforum.

beir/cqadupstack/wordpress

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the wordpress StackExchange subforum.

Dataset irds.beir.cqadupstack.wordpress.documents

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the wordpress StackExchange subforum.

Dataset irds.beir.cqadupstack.wordpress.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the wordpress StackExchange subforum.

Dataset irds.beir.cqadupstack.wordpress.qrels

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the wordpress StackExchange subforum.

Dataset irds.beir.cqadupstack.wordpress

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the wordpress StackExchange subforum.

beir/dbpedia-entity

A version of the DBPedia-Entity-v2 dataset for entity retrieval.

Dataset irds.beir.dbpedia-entity.documents

A version of the DBPedia-Entity-v2 dataset for entity retrieval.

Dataset irds.beir.dbpedia-entity.queries

A version of the DBPedia-Entity-v2 dataset for entity retrieval.

Dataset irds.beir.dbpedia-entity.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A random sample of 67 queries from the official test set, used as a dev set.

Dataset irds.beir.dbpedia-entity.dev.qrels

A random sample of 67 queries from the official test set, used as a dev set.

Dataset irds.beir.dbpedia-entity.dev

A random sample of 67 queries from the official test set, used as a dev set.

Dataset irds.beir.dbpedia-entity.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A the official test set, without 67 queries used as a dev set.

Dataset irds.beir.dbpedia-entity.test.qrels

A the official test set, without 67 queries used as a dev set.

Dataset irds.beir.dbpedia-entity.test

A the official test set, without 67 queries used as a dev set.

beir/fever

A version of the FEVER dataset for fact verification. Includes queries from the /train /dev and /test subsets.

Dataset irds.beir.fever.documents

A version of the FEVER dataset for fact verification. Includes queries from the /train /dev and /test subsets.

Dataset irds.beir.fever.queries

A version of the FEVER dataset for fact verification. Includes queries from the /train /dev and /test subsets.

Dataset irds.beir.fever.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The official dev set.

Dataset irds.beir.fever.dev.qrels

The official dev set.

Dataset irds.beir.fever.dev

The official dev set.

Dataset irds.beir.fever.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The official test set.

Dataset irds.beir.fever.test.qrels

The official test set.

Dataset irds.beir.fever.test

The official test set.

Dataset irds.beir.fever.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The official train set.

Dataset irds.beir.fever.train.qrels

The official train set.

Dataset irds.beir.fever.train

The official train set.

beir/fiqa

A version of the FIQA-2018 dataset (financial opinion question answering). Queries include those in the /train /dev and /test subsets.

Dataset irds.beir.fiqa.documents

A version of the FIQA-2018 dataset (financial opinion question answering). Queries include those in the /train /dev and /test subsets.

Dataset irds.beir.fiqa.queries

A version of the FIQA-2018 dataset (financial opinion question answering). Queries include those in the /train /dev and /test subsets.

Dataset irds.beir.fiqa.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Random sample of 500 queries from the official dataset.

Dataset irds.beir.fiqa.dev.qrels

Random sample of 500 queries from the official dataset.

Dataset irds.beir.fiqa.dev

Random sample of 500 queries from the official dataset.

Dataset irds.beir.fiqa.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Random sample of 648 queries from the official dataset.

Dataset irds.beir.fiqa.test.qrels

Random sample of 648 queries from the official dataset.

Dataset irds.beir.fiqa.test

Random sample of 648 queries from the official dataset.

Dataset irds.beir.fiqa.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official dataset without the 1148 queries sampled for /dev and /test.

Dataset irds.beir.fiqa.train.qrels

Official dataset without the 1148 queries sampled for /dev and /test.

Dataset irds.beir.fiqa.train

Official dataset without the 1148 queries sampled for /dev and /test.

beir/hotpotqa

A version of the Hotpot QA dataset for multi-hop question answering. Queries include all those in /train /dev and /test.

Dataset irds.beir.hotpotqa.documents

A version of the Hotpot QA dataset for multi-hop question answering. Queries include all those in /train /dev and /test.

Dataset irds.beir.hotpotqa.queries

A version of the Hotpot QA dataset for multi-hop question answering. Queries include all those in /train /dev and /test.

Dataset irds.beir.hotpotqa.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Random selection of the 5447 queries from /train.

Dataset irds.beir.hotpotqa.dev.qrels

Random selection of the 5447 queries from /train.

Dataset irds.beir.hotpotqa.dev

Random selection of the 5447 queries from /train.

Dataset irds.beir.hotpotqa.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official dev set from HotpotQA, here used as a test set.

Dataset irds.beir.hotpotqa.test.qrels

Official dev set from HotpotQA, here used as a test set.

Dataset irds.beir.hotpotqa.test

Official dev set from HotpotQA, here used as a test set.

Dataset irds.beir.hotpotqa.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official train set, without the random selection of the 5447 queries used for /dev.

Dataset irds.beir.hotpotqa.train.qrels

Official train set, without the random selection of the 5447 queries used for /dev.

Dataset irds.beir.hotpotqa.train

Official train set, without the random selection of the 5447 queries used for /dev.

beir/msmarco

A version of the MS MARCO passage ranking dataset. Includes queries from the /train, /dev, and /test sub-datasets.

Note that this version differs from msmarco-passage, in that it does not correct the encoding problems in the source documents.

Dataset irds.beir.msmarco.documents

A version of the MS MARCO passage ranking dataset. Includes queries from the /train, /dev, and /test sub-datasets.

Note that this version differs from msmarco-passage, in that it does not correct the encoding problems in the source documents.

Dataset irds.beir.msmarco.queries

A version of the MS MARCO passage ranking dataset. Includes queries from the /train, /dev, and /test sub-datasets.

Note that this version differs from msmarco-passage, in that it does not correct the encoding problems in the source documents.

Dataset irds.beir.msmarco.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the MS MARCO passage ranking dev set.

See also: msmarco-passage/dev
Dataset Paper

Dataset irds.beir.msmarco.dev.qrels

A version of the MS MARCO passage ranking dev set.

See also: msmarco-passage/dev
Dataset Paper

Dataset irds.beir.msmarco.dev

A version of the MS MARCO passage ranking dev set.

See also: msmarco-passage/dev
Dataset Paper

Dataset irds.beir.msmarco.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the TREC Deep Learning 2019 set.

See also: msmarco-passage/trec-dl-2019
Shared Task Paper

Dataset irds.beir.msmarco.test.qrels

A version of the TREC Deep Learning 2019 set.

See also: msmarco-passage/trec-dl-2019
Shared Task Paper

Dataset irds.beir.msmarco.test

A version of the TREC Deep Learning 2019 set.

See also: msmarco-passage/trec-dl-2019
Shared Task Paper

Dataset irds.beir.msmarco.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the MS MARCO passage ranking train set.

See also: msmarco-passage/train

Dataset irds.beir.msmarco.train.qrels

A version of the MS MARCO passage ranking train set.

See also: msmarco-passage/train

Dataset irds.beir.msmarco.train

A version of the MS MARCO passage ranking train set.

See also: msmarco-passage/train

beir/nfcorpus

A version of the NF Corpus (Nutrition Facts). Queries use the "title" variant of the query, which here are often natural language questions. Queries include all those from /train /dev and /test.

Data pre-processing may be different than what is done in nfcorpus.

Dataset irds.beir.nfcorpus.documents

A version of the NF Corpus (Nutrition Facts). Queries use the "title" variant of the query, which here are often natural language questions. Queries include all those from /train /dev and /test.

Data pre-processing may be different than what is done in nfcorpus.

Dataset irds.beir.nfcorpus.queries

A version of the NF Corpus (Nutrition Facts). Queries use the "title" variant of the query, which here are often natural language questions. Queries include all those from /train /dev and /test.

Data pre-processing may be different than what is done in nfcorpus.

Dataset irds.beir.nfcorpus.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Combined dev set of NFCorpus.

See also: nfcorpus/dev

Dataset irds.beir.nfcorpus.dev.qrels

Combined dev set of NFCorpus.

See also: nfcorpus/dev

Dataset irds.beir.nfcorpus.dev

Combined dev set of NFCorpus.

See also: nfcorpus/dev

Dataset irds.beir.nfcorpus.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Combined test set of NFCorpus.

See also: nfcorpus/test

Dataset irds.beir.nfcorpus.test.qrels

Combined test set of NFCorpus.

See also: nfcorpus/test

Dataset irds.beir.nfcorpus.test

Combined test set of NFCorpus.

See also: nfcorpus/test

Dataset irds.beir.nfcorpus.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Combined train set of NFCorpus.

See also: nfcorpus/train

Dataset irds.beir.nfcorpus.train.qrels

Combined train set of NFCorpus.

See also: nfcorpus/train

Dataset irds.beir.nfcorpus.train

Combined train set of NFCorpus.

See also: nfcorpus/train

beir/nq

A version of the Natural Questions dev dataset.

Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.

Dataset website
Dataset paper
See also: natural-questions, dpr-w100/natural-questions

Dataset irds.beir.nq.documents

A version of the Natural Questions dev dataset.

Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.

Dataset website
Dataset paper
See also: natural-questions, dpr-w100/natural-questions

Dataset irds.beir.nq.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the Natural Questions dev dataset.

Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.

Dataset website
Dataset paper
See also: natural-questions, dpr-w100/natural-questions

Dataset irds.beir.nq.qrels

A version of the Natural Questions dev dataset.

Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.

Dataset website
Dataset paper
See also: natural-questions, dpr-w100/natural-questions

Dataset irds.beir.nq

A version of the Natural Questions dev dataset.

Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.

Dataset website
Dataset paper
See also: natural-questions, dpr-w100/natural-questions

beir/quora

A version of the Quora duplicate question detection dataset (QQP). Includes queries from /dev and /test sets.

Dataset website

Dataset irds.beir.quora.documents

A version of the Quora duplicate question detection dataset (QQP). Includes queries from /dev and /test sets.

Dataset website

Dataset irds.beir.quora.queries

A version of the Quora duplicate question detection dataset (QQP). Includes queries from /dev and /test sets.

Dataset website

Dataset irds.beir.quora.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A 5,000 question subset of the original dataset, without overlaps in the other subsets.

Dataset irds.beir.quora.dev.qrels

A 5,000 question subset of the original dataset, without overlaps in the other subsets.

Dataset irds.beir.quora.dev

A 5,000 question subset of the original dataset, without overlaps in the other subsets.

Dataset irds.beir.quora.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A 10,000 question subset of the original dataset, without overlaps in the other subsets.

Dataset irds.beir.quora.test.qrels

A 10,000 question subset of the original dataset, without overlaps in the other subsets.

Dataset irds.beir.quora.test

A 10,000 question subset of the original dataset, without overlaps in the other subsets.

beir/scidocs

A version of the SciDocs dataset, used for citation retrieval.

Dataset irds.beir.scidocs.documents

A version of the SciDocs dataset, used for citation retrieval.

Dataset irds.beir.scidocs.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the SciDocs dataset, used for citation retrieval.

Dataset irds.beir.scidocs.qrels

A version of the SciDocs dataset, used for citation retrieval.

Dataset irds.beir.scidocs

A version of the SciDocs dataset, used for citation retrieval.

beir/scifact

A version of the SciFact dataset, for fact verification. Queries include those form the /train and /test sets.

Dataset irds.beir.scifact.documents

A version of the SciFact dataset, for fact verification. Queries include those form the /train and /test sets.

Dataset irds.beir.scifact.queries

A version of the SciFact dataset, for fact verification. Queries include those form the /train and /test sets.

Dataset irds.beir.scifact.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The official dev set.

Dataset irds.beir.scifact.test.qrels

The official dev set.

Dataset irds.beir.scifact.test

The official dev set.

Dataset irds.beir.scifact.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The official train set.

Dataset irds.beir.scifact.train.qrels

The official train set.

Dataset irds.beir.scifact.train

The official train set.

beir/trec-covid

A version of the TREC COVID (complete) dataset, with titles and abstracts as documents. Queries are the question variant.

Data pre-processing may be different than what is done in cord19/trec-covid.

Dataset irds.beir.trec-covid.documents

A version of the TREC COVID (complete) dataset, with titles and abstracts as documents. Queries are the question variant.

Data pre-processing may be different than what is done in cord19/trec-covid.

Dataset irds.beir.trec-covid.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the TREC COVID (complete) dataset, with titles and abstracts as documents. Queries are the question variant.

Data pre-processing may be different than what is done in cord19/trec-covid.

Dataset irds.beir.trec-covid.qrels

A version of the TREC COVID (complete) dataset, with titles and abstracts as documents. Queries are the question variant.

Data pre-processing may be different than what is done in cord19/trec-covid.

Dataset irds.beir.trec-covid

A version of the TREC COVID (complete) dataset, with titles and abstracts as documents. Queries are the question variant.

Data pre-processing may be different than what is done in cord19/trec-covid.

beir/webis-touche2020

Original version of the Touchè-2020 dataset, for argument retrieval.

Consider using beir/webis-touche2020/v2 instead; it uses an updated, more complete version of the qrels.

Dataset irds.beir.webis-touche2020.documents

Original version of the Touchè-2020 dataset, for argument retrieval.

Consider using beir/webis-touche2020/v2 instead; it uses an updated, more complete version of the qrels.

Dataset irds.beir.webis-touche2020.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Original version of the Touchè-2020 dataset, for argument retrieval.

Consider using beir/webis-touche2020/v2 instead; it uses an updated, more complete version of the qrels.

Dataset irds.beir.webis-touche2020.qrels

Original version of the Touchè-2020 dataset, for argument retrieval.

Consider using beir/webis-touche2020/v2 instead; it uses an updated, more complete version of the qrels.

Dataset irds.beir.webis-touche2020

Original version of the Touchè-2020 dataset, for argument retrieval.

Consider using beir/webis-touche2020/v2 instead; it uses an updated, more complete version of the qrels.

beir/webis-touche2020/v2

Version 2 of the Touchè-2020 dataset, for argument retrieval. This version uses the "corrected" version of the qrels, mapped to version 1 of the corpus.

Dataset irds.beir.webis-touche2020.v2.documents

Version 2 of the Touchè-2020 dataset, for argument retrieval. This version uses the "corrected" version of the qrels, mapped to version 1 of the corpus.

Dataset irds.beir.webis-touche2020.v2.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version 2 of the Touchè-2020 dataset, for argument retrieval. This version uses the "corrected" version of the qrels, mapped to version 1 of the corpus.

Dataset irds.beir.webis-touche2020.v2.qrels

Version 2 of the Touchè-2020 dataset, for argument retrieval. This version uses the "corrected" version of the qrels, mapped to version 1 of the corpus.

Dataset irds.beir.webis-touche2020.v2

Version 2 of the Touchè-2020 dataset, for argument retrieval. This version uses the "corrected" version of the qrels, mapped to version 1 of the corpus.

c4/en-noclean-tr

The "en-noclean" train subset of the corpus, consisting of ~1B documents written in English. Document IDs are assigned as proposed by the TREC Health Misinformation 2021 track.

Dataset irds.c4.en-noclean-tr.documents

The "en-noclean" train subset of the corpus, consisting of ~1B documents written in English. Document IDs are assigned as proposed by the TREC Health Misinformation 2021 track.

Dataset irds.c4.en-noclean-tr.trec-misinfo-2021.queries

The TREC Health Misinformation 2021 track.

Shared Task Website

car/v1.5

Version 1.5 of the TREC dataset. This version is used for year 1 (2017) of the TREC CAR shared task.

Dataset irds.car.v1.5.documents

Version 1.5 of the TREC dataset. This version is used for year 1 (2017) of the TREC CAR shared task.

Dataset irds.car.v1.5.test200.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Un-official test set consisting of manually-selected articles. Sometimes used as a validation set.

Dataset irds.car.v1.5.test200.qrels

Un-official test set consisting of manually-selected articles. Sometimes used as a validation set.

Dataset irds.car.v1.5.test200

Un-official test set consisting of manually-selected articles. Sometimes used as a validation set.

Dataset irds.car.v1.5.train.fold0.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 0 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold0.qrels

Fold 0 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold0

Fold 0 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold1.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 1 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold1.qrels

Fold 1 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold1

Fold 1 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold2.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 2 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold2.qrels

Fold 2 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold2

Fold 2 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold3.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 3 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold3.qrels

Fold 3 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold3

Fold 3 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold4.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 4 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold4.qrels

Fold 4 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.train.fold4

Fold 4 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.trec-y1.queries

Official test set of TREC CAR 2017 (year 1).

Dataset irds.car.v1.5.trec-y1.auto.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official test set of TREC CAR 2017 (year 1), using automatic relevance judgments (assumed from hierarchical structure of pages, i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.trec-y1.auto.qrels

Official test set of TREC CAR 2017 (year 1), using automatic relevance judgments (assumed from hierarchical structure of pages, i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.trec-y1.auto

Official test set of TREC CAR 2017 (year 1), using automatic relevance judgments (assumed from hierarchical structure of pages, i.e., paragraphs under a header are assumed relevant.)

Dataset irds.car.v1.5.trec-y1.manual.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official test set of TREC CAR 2017 (year 1), using manual graded relevance judgments.

Dataset irds.car.v1.5.trec-y1.manual.qrels

Official test set of TREC CAR 2017 (year 1), using manual graded relevance judgments.

Dataset irds.car.v1.5.trec-y1.manual

Official test set of TREC CAR 2017 (year 1), using manual graded relevance judgments.

car/v2.0

Version 2.0 of the TREC CAR dataset.

Dataset irds.car.v2.0.documents

Version 2.0 of the TREC CAR dataset.

Highwire (TREC Genomics 2006-07)

Medical document collection from Highwire Press. Includes 162,259 scientific articles from 49 journals.

This dataset is used for the TREC 2006-07 TREC Genomics track.

Note that these documents are split into passages based on paragraph tags in the HTML.

Documents: Biomedical journal articles
Information about document collection

Dataset irds.highwire.documents

Medical document collection from Highwire Press. Includes 162,259 scientific articles from 49 journals.

This dataset is used for the TREC 2006-07 TREC Genomics track.

Note that these documents are split into passages based on paragraph tags in the HTML.

Documents: Biomedical journal articles
Information about document collection

Dataset irds.highwire.trec-genomics-2006.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Genomics Track 2006 benchmark. Contains 28 queries with passage-level relevance judgments.

Documents: Biomedical journal articles
Queries: Natural language questions
Qrels: deep, by passage
Shared task data site
Shared task paper

Dataset irds.highwire.trec-genomics-2006.qrels

The TREC Genomics Track 2006 benchmark. Contains 28 queries with passage-level relevance judgments.

Documents: Biomedical journal articles
Queries: Natural language questions
Qrels: deep, by passage
Shared task data site
Shared task paper

Dataset irds.highwire.trec-genomics-2006

The TREC Genomics Track 2006 benchmark. Contains 28 queries with passage-level relevance judgments.

Documents: Biomedical journal articles
Queries: Natural language questions
Qrels: deep, by passage
Shared task data site
Shared task paper

Dataset irds.highwire.trec-genomics-2007.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Genomics Track 2007 benchmark. Contains 36 queries with passage-level relevance judgments.

Documents: Biomedical journal articles
Queries: Natural language questions
Qrels: deep, by passage
Shared task data site
Shared task paper

Dataset irds.highwire.trec-genomics-2007.qrels

The TREC Genomics Track 2007 benchmark. Contains 36 queries with passage-level relevance judgments.

Documents: Biomedical journal articles
Queries: Natural language questions
Qrels: deep, by passage
Shared task data site
Shared task paper

Dataset irds.highwire.trec-genomics-2007

The TREC Genomics Track 2007 benchmark. Contains 36 queries with passage-level relevance judgments.

Documents: Biomedical journal articles
Queries: Natural language questions
Qrels: deep, by passage
Shared task data site
Shared task paper

medline/2004

3M Medline articles including titles and abstracts, used for the TREC 2004-05 Genomics track.

Documents: Biomedical article titles and abstracts
Information about document collection

Dataset irds.medline.2004.documents

3M Medline articles including titles and abstracts, used for the TREC 2004-05 Genomics track.

Documents: Biomedical article titles and abstracts
Information about document collection

Dataset irds.medline.2004.trec-genomics-2004.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Genomics Track 2004 benchmark. Contains 50 queries with article-level relevance judgments.

Documents: Biomedical article titles and abstracts
Queries: Natural language questions
Qrels: deep, graded
Shared task data site
Shared task paper

Dataset irds.medline.2004.trec-genomics-2004.qrels

The TREC Genomics Track 2004 benchmark. Contains 50 queries with article-level relevance judgments.

Documents: Biomedical article titles and abstracts
Queries: Natural language questions
Qrels: deep, graded
Shared task data site
Shared task paper

Dataset irds.medline.2004.trec-genomics-2004

The TREC Genomics Track 2004 benchmark. Contains 50 queries with article-level relevance judgments.

Documents: Biomedical article titles and abstracts
Queries: Natural language questions
Qrels: deep, graded
Shared task data site
Shared task paper

Dataset irds.medline.2004.trec-genomics-2005.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Genomics Track 2005 benchmark. Contains 50 queries with article-level relevance judgments.

Documents: Biomedical article titles and abstracts
Queries: Natural language questions
Qrels: deep, graded
Shared task data site
Shared task paper

Dataset irds.medline.2004.trec-genomics-2005.qrels

The TREC Genomics Track 2005 benchmark. Contains 50 queries with article-level relevance judgments.

Documents: Biomedical article titles and abstracts
Queries: Natural language questions
Qrels: deep, graded
Shared task data site
Shared task paper

Dataset irds.medline.2004.trec-genomics-2005

The TREC Genomics Track 2005 benchmark. Contains 50 queries with article-level relevance judgments.

Documents: Biomedical article titles and abstracts
Queries: Natural language questions
Qrels: deep, graded
Shared task data site
Shared task paper

medline/2017

26M Medline and AACR/ASCO Proceedings articles including titles and abstracts. This collection is used for the TREC 2017-18 TREC Precision Medicine track.

Documents: Biomedical article titles and abstracts
Information about document collection

Dataset irds.medline.2017.documents

26M Medline and AACR/ASCO Proceedings articles including titles and abstracts. This collection is used for the TREC 2017-18 TREC Precision Medicine track.

Documents: Biomedical article titles and abstracts
Information about document collection

Dataset irds.medline.2017.trec-pm-2017.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Precision Medicine (PM) Track 2017 benchmark. Contains 30 queries containing disease, gene, and target demographic information.

Documents: Biomedical article titles and abstracts
Queries: Specific to TREC PM information need
Qrels: deep, graded
Shared task data site
Shared task paper

Dataset irds.medline.2017.trec-pm-2017.qrels

The TREC Precision Medicine (PM) Track 2017 benchmark. Contains 30 queries containing disease, gene, and target demographic information.

Documents: Biomedical article titles and abstracts
Queries: Specific to TREC PM information need
Qrels: deep, graded
Shared task data site
Shared task paper

Dataset irds.medline.2017.trec-pm-2017

The TREC Precision Medicine (PM) Track 2017 benchmark. Contains 30 queries containing disease, gene, and target demographic information.

Documents: Biomedical article titles and abstracts
Queries: Specific to TREC PM information need
Qrels: deep, graded
Shared task data site
Shared task paper

Dataset irds.medline.2017.trec-pm-2018.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Precision Medicine (PM) Track 2018 benchmark. Contains 50 queries containing disease, gene, and target demographic information.

Documents: Biomedical article titles and abstracts
Queries: Specific to TREC PM information need
Qrels: deep, graded
Shared task data site
Shared task paper

Dataset irds.medline.2017.trec-pm-2018.qrels

The TREC Precision Medicine (PM) Track 2018 benchmark. Contains 50 queries containing disease, gene, and target demographic information.

Documents: Biomedical article titles and abstracts
Queries: Specific to TREC PM information need
Qrels: deep, graded
Shared task data site
Shared task paper

Dataset irds.medline.2017.trec-pm-2018

The TREC Precision Medicine (PM) Track 2018 benchmark. Contains 50 queries containing disease, gene, and target demographic information.

Documents: Biomedical article titles and abstracts
Queries: Specific to TREC PM information need
Qrels: deep, graded
Shared task data site
Shared task paper

clinicaltrials/2017

A snapshot of ClinicalTrials.gov from April 2017 for use with the clinicaltrials/2017/trec-pm-2017 and clinicaltrials/2017/trec-pm-2018 Clinical Trials subtasks.

Dataset information

Dataset irds.clinicaltrials.2017.documents

A snapshot of ClinicalTrials.gov from April 2017 for use with the clinicaltrials/2017/trec-pm-2017 and clinicaltrials/2017/trec-pm-2018 Clinical Trials subtasks.

Dataset information

Dataset irds.clinicaltrials.2017.trec-pm-2017.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC 2017 Precision Medicine clinical trials subtask.

Dataset irds.clinicaltrials.2017.trec-pm-2017.qrels

The TREC 2017 Precision Medicine clinical trials subtask.

Dataset irds.clinicaltrials.2017.trec-pm-2017

The TREC 2017 Precision Medicine clinical trials subtask.

Dataset irds.clinicaltrials.2017.trec-pm-2018.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC 2018 Precision Medicine clinical trials subtask.

Dataset irds.clinicaltrials.2017.trec-pm-2018.qrels

The TREC 2018 Precision Medicine clinical trials subtask.

Dataset irds.clinicaltrials.2017.trec-pm-2018

The TREC 2018 Precision Medicine clinical trials subtask.

clinicaltrials/2019

A snapshot of ClinicalTrials.gov from May 2019 for use with the clinicaltrials/2019/trec-pm-2019 Clinical Trials subtask.

Dataset information

Dataset irds.clinicaltrials.2019.documents

A snapshot of ClinicalTrials.gov from May 2019 for use with the clinicaltrials/2019/trec-pm-2019 Clinical Trials subtask.

Dataset information

Dataset irds.clinicaltrials.2019.trec-pm-2019.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC 2019 Precision Medicine clinical trials subtask.

Dataset irds.clinicaltrials.2019.trec-pm-2019.qrels

The TREC 2019 Precision Medicine clinical trials subtask.

Dataset irds.clinicaltrials.2019.trec-pm-2019

The TREC 2019 Precision Medicine clinical trials subtask.

clinicaltrials/2021

A snapshot of ClinicalTrials.gov from April 2021 for use with the TREC Clinical Trials 2021 Track.

Dataset information

Dataset irds.clinicaltrials.2021.documents

A snapshot of ClinicalTrials.gov from April 2021 for use with the TREC Clinical Trials 2021 Track.

Dataset information

Dataset irds.clinicaltrials.2021.trec-ct-2021.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Clinical Trials 2021 track.

Shared Task Website

Dataset irds.clinicaltrials.2021.trec-ct-2021.qrels

The TREC Clinical Trials 2021 track.

Shared Task Website

Dataset irds.clinicaltrials.2021.trec-ct-2021

The TREC Clinical Trials 2021 track.

Shared Task Website

Dataset irds.clinicaltrials.2021.trec-ct-2022.queries

The TREC Clinical Trials 2022 track.

Shared Task Website

ClueWeb09

ClueWeb 2009 web document collection. Contains over 1B web pages, in 10 languages.

The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.

Document collection site

Dataset irds.clueweb09.documents

ClueWeb 2009 web document collection. Contains over 1B web pages, in 10 languages.

The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.

Document collection site

Dataset irds.clueweb09.trec-mq-2009.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

TREC 2009 Million Query track.

Dataset irds.clueweb09.trec-mq-2009.qrels

TREC 2009 Million Query track.

Dataset irds.clueweb09.trec-mq-2009

TREC 2009 Million Query track.

clueweb09/ar

Subset of ClueWeb09 with only Arabic-language documents.

Dataset irds.clueweb09.ar.documents

Subset of ClueWeb09 with only Arabic-language documents.

clueweb09/catb

Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.

Dataset irds.clueweb09.catb.documents

Subset of ClueWeb09 with the first ~50 million English-language documents. Used as a smaller collection for TREC Web Track tasks.

Dataset irds.clueweb09.catb.trec-web-2009.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2009.qrels

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2009

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2009.diversity.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2009.diversity.qrels

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2009.diversity

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2010.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2010.qrels

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2010

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2010.diversity.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2010.diversity.qrels

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2010.diversity

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2011.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2011.qrels

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2011

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2011.diversity.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2011.diversity.qrels

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2011.diversity

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2012.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2012.qrels

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2012

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2012.diversity.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2012.diversity.qrels

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.catb.trec-web-2012.diversity

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

clueweb09/de

Subset of ClueWeb09 with only German-language documents.

Dataset irds.clueweb09.de.documents

Subset of ClueWeb09 with only German-language documents.

clueweb09/en

Subset of ClueWeb09 with only English-language documents.

Dataset irds.clueweb09.en.documents

Subset of ClueWeb09 with only English-language documents.

Dataset irds.clueweb09.en.trec-web-2009.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2009.qrels

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2009

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2009.diversity.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2009.diversity.qrels

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2009.diversity

The TREC Web Track 2009 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2010.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2010.qrels

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2010

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2010.diversity.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2010.diversity.qrels

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2010.diversity

The TREC Web Track 2010 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2011.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2011.qrels

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2011

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2011.diversity.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2011.diversity.qrels

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2011.diversity

The TREC Web Track 2011 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2012.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2012.qrels

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2012

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2012.diversity.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2012.diversity.qrels

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb09.en.trec-web-2012.diversity

The TREC Web Track 2012 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

clueweb09/es

Subset of ClueWeb09 with only Spanish-language documents.

Dataset irds.clueweb09.es.documents

Subset of ClueWeb09 with only Spanish-language documents.

clueweb09/fr

Subset of ClueWeb09 with only French-language documents.

Dataset irds.clueweb09.fr.documents

Subset of ClueWeb09 with only French-language documents.

clueweb09/it

Subset of ClueWeb09 with only Italian-language documents.

Dataset irds.clueweb09.it.documents

Subset of ClueWeb09 with only Italian-language documents.

clueweb09/ja

Subset of ClueWeb09 with only Japanese-language documents.

Dataset irds.clueweb09.ja.documents

Subset of ClueWeb09 with only Japanese-language documents.

clueweb09/ko

Subset of ClueWeb09 with only Korean-language documents.

Dataset irds.clueweb09.ko.documents

Subset of ClueWeb09 with only Korean-language documents.

clueweb09/pt

Subset of ClueWeb09 with only Portuguese-language documents.

Dataset irds.clueweb09.pt.documents

Subset of ClueWeb09 with only Portuguese-language documents.

clueweb09/zh

Subset of ClueWeb09 with only Chinese-language documents.

Dataset irds.clueweb09.zh.documents

Subset of ClueWeb09 with only Chinese-language documents.

ClueWeb12

ClueWeb 2012 web document collection. Contains 733M web pages.

The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.

Dataset irds.clueweb12.documents

ClueWeb 2012 web document collection. Contains 733M web pages.

The dataset is obtained for a fee from CMU, and is shipped as hard drives. More information is provided here.

Dataset irds.clueweb12.trec-web-2013.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2013 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb12.trec-web-2013.qrels

The TREC Web Track 2013 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb12.trec-web-2013

The TREC Web Track 2013 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb12.trec-web-2013.diversity.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2013 diverse ranking benchmark. Contains 50 queries with deep subtopic relevance judgments.

Dataset irds.clueweb12.trec-web-2013.diversity.qrels

The TREC Web Track 2013 diverse ranking benchmark. Contains 50 queries with deep subtopic relevance judgments.

Dataset irds.clueweb12.trec-web-2013.diversity

The TREC Web Track 2013 diverse ranking benchmark. Contains 50 queries with deep subtopic relevance judgments.

Dataset irds.clueweb12.trec-web-2014.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2014 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb12.trec-web-2014.qrels

The TREC Web Track 2014 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb12.trec-web-2014

The TREC Web Track 2014 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.clueweb12.trec-web-2014.diversity.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2014 diverse ranking benchmark. Contains 50 queries with deep subtopic relevance judgments.

Dataset irds.clueweb12.trec-web-2014.diversity.qrels

The TREC Web Track 2014 diverse ranking benchmark. Contains 50 queries with deep subtopic relevance judgments.

Dataset irds.clueweb12.trec-web-2014.diversity

The TREC Web Track 2014 diverse ranking benchmark. Contains 50 queries with deep subtopic relevance judgments.

Dataset irds.clueweb12.touche-2020-task-2.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Decision making processes, be it at the societal or at the personal level, eventually come to a point where one side will challenge the other with a why-question, which is a prompt to justify one's stance. Thus, technologies for argument mining and argumentation processing are maturing at a rapid pace, giving rise for the first time to argument retrieval. Touché 2020 is the first lab on Argument Retrieval at CLEF 2020 featuring two tasks.

Given a comparative question, retrieve and rank documents from the ClueWeb12 that help to answer the comparative question.

Documents are judged based on their general topical relevance.

Dataset irds.clueweb12.touche-2020-task-2.qrels

Decision making processes, be it at the societal or at the personal level, eventually come to a point where one side will challenge the other with a why-question, which is a prompt to justify one's stance. Thus, technologies for argument mining and argumentation processing are maturing at a rapid pace, giving rise for the first time to argument retrieval. Touché 2020 is the first lab on Argument Retrieval at CLEF 2020 featuring two tasks.

Given a comparative question, retrieve and rank documents from the ClueWeb12 that help to answer the comparative question.

Documents are judged based on their general topical relevance.

Dataset irds.clueweb12.touche-2020-task-2

Decision making processes, be it at the societal or at the personal level, eventually come to a point where one side will challenge the other with a why-question, which is a prompt to justify one's stance. Thus, technologies for argument mining and argumentation processing are maturing at a rapid pace, giving rise for the first time to argument retrieval. Touché 2020 is the first lab on Argument Retrieval at CLEF 2020 featuring two tasks.

Given a comparative question, retrieve and rank documents from the ClueWeb12 that help to answer the comparative question.

Documents are judged based on their general topical relevance.

Dataset irds.clueweb12.touche-2021-task-2.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2021 is the second lab on argument retrieval at CLEF 2021 featuring two tasks.

Given a comparative question, retrieve and rank documents from the ClueWeb12 that help to answer the comparative question.

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

Dataset irds.clueweb12.touche-2021-task-2.qrels

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2021 is the second lab on argument retrieval at CLEF 2021 featuring two tasks.

Given a comparative question, retrieve and rank documents from the ClueWeb12 that help to answer the comparative question.

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

Dataset irds.clueweb12.touche-2021-task-2

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2021 is the second lab on argument retrieval at CLEF 2021 featuring two tasks.

Given a comparative question, retrieve and rank documents from the ClueWeb12 that help to answer the comparative question.

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

clueweb12/b13

Official subset of the ClueWeb12 datasets with 52M web pages.

Dataset irds.clueweb12.b13.documents

Official subset of the ClueWeb12 datasets with 52M web pages.

Dataset irds.clueweb12.b13.clef-ehealth.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The CLEF eHealth 2016-17 IR dataset. Contains consumer health queries and judgments containing trustworthiness and understandability scores, in addition to the normal relevance assessments.

This dataset contains the combined 2016 and 2017 relevance judgments, since the same queries were used in the two year. The assessment year can be distinguished using iteration (2016 is iteration 0, 2017 is iteration 1).

Dataset irds.clueweb12.b13.clef-ehealth.qrels

The CLEF eHealth 2016-17 IR dataset. Contains consumer health queries and judgments containing trustworthiness and understandability scores, in addition to the normal relevance assessments.

This dataset contains the combined 2016 and 2017 relevance judgments, since the same queries were used in the two year. The assessment year can be distinguished using iteration (2016 is iteration 0, 2017 is iteration 1).

Dataset irds.clueweb12.b13.clef-ehealth

The CLEF eHealth 2016-17 IR dataset. Contains consumer health queries and judgments containing trustworthiness and understandability scores, in addition to the normal relevance assessments.

This dataset contains the combined 2016 and 2017 relevance judgments, since the same queries were used in the two year. The assessment year can be distinguished using iteration (2016 is iteration 0, 2017 is iteration 1).

Dataset irds.clueweb12.b13.clef-ehealth.cs.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Czech. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.cs.qrels

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Czech. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.cs

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Czech. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.de.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to German. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.de.qrels

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to German. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.de

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to German. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.fr.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to French. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.fr.qrels

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to French. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.fr

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to French. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.hu.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Hungarian. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.hu.qrels

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Hungarian. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.hu

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Hungarian. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.pl.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Polish. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.pl.qrels

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Polish. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.pl

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Polish. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.sv.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Swedish. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.sv.qrels

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Swedish. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.clef-ehealth.sv

The CLEF eHealth 2016-17 IR dataset, with queries professionally translataed to Swedish. See clueweb12/b13/clef-ehealth for more details.

Dataset irds.clueweb12.b13.ntcir-www-1.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The NTCIR-13 We Want Web (WWW) 1 ad-hoc ranking benchmark. Contains 100 queries with deep relevance judgments (avg 255 per query). Judgments aggregated from two assessors. Note that the qrels contain additional judgments from the NTCIR-14 CENTRE track.

Dataset irds.clueweb12.b13.ntcir-www-1.qrels

The NTCIR-13 We Want Web (WWW) 1 ad-hoc ranking benchmark. Contains 100 queries with deep relevance judgments (avg 255 per query). Judgments aggregated from two assessors. Note that the qrels contain additional judgments from the NTCIR-14 CENTRE track.

Dataset irds.clueweb12.b13.ntcir-www-1

The NTCIR-13 We Want Web (WWW) 1 ad-hoc ranking benchmark. Contains 100 queries with deep relevance judgments (avg 255 per query). Judgments aggregated from two assessors. Note that the qrels contain additional judgments from the NTCIR-14 CENTRE track.

Dataset irds.clueweb12.b13.ntcir-www-2.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The NTCIR-14 We Want Web (WWW) 2 ad-hoc ranking benchmark. Contains 80 queries with deep relevance judgments (avg 345 per query). Judgments aggregated from two assessors.

Dataset irds.clueweb12.b13.ntcir-www-2.qrels

The NTCIR-14 We Want Web (WWW) 2 ad-hoc ranking benchmark. Contains 80 queries with deep relevance judgments (avg 345 per query). Judgments aggregated from two assessors.

Dataset irds.clueweb12.b13.ntcir-www-2

The NTCIR-14 We Want Web (WWW) 2 ad-hoc ranking benchmark. Contains 80 queries with deep relevance judgments (avg 345 per query). Judgments aggregated from two assessors.

Dataset irds.clueweb12.b13.ntcir-www-3.queries

The NTCIR-15 We Want Web (WWW) 3 ad-hoc ranking benchmark. Contains 160 queries with deep relevance judgments (to be released). 80 of the queries are from clueweb12/b13/ntcir-www-2.

Shared task site

Dataset irds.clueweb12.b13.trec-misinfo-2019.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Medical Misinformation 2019 dataset.

Dataset irds.clueweb12.b13.trec-misinfo-2019.qrels

The TREC Medical Misinformation 2019 dataset.

Dataset irds.clueweb12.b13.trec-misinfo-2019

The TREC Medical Misinformation 2019 dataset.

CODEC

CODEC Document Ranking sub-task.

Documents: curated web articles
Queries: challenging, entity-focused queries
Task Repository
See also: kilt/codec, the entity ranking subtask

Dataset irds.codec.documents

CODEC Document Ranking sub-task.

Documents: curated web articles
Queries: challenging, entity-focused queries
Task Repository
See also: kilt/codec, the entity ranking subtask

Dataset irds.codec.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

CODEC Document Ranking sub-task.

Documents: curated web articles
Queries: challenging, entity-focused queries
Task Repository
See also: kilt/codec, the entity ranking subtask

Dataset irds.codec.qrels

CODEC Document Ranking sub-task.

Documents: curated web articles
Queries: challenging, entity-focused queries
Task Repository
See also: kilt/codec, the entity ranking subtask

Dataset irds.codec

CODEC Document Ranking sub-task.

Documents: curated web articles
Queries: challenging, entity-focused queries
Task Repository
See also: kilt/codec, the entity ranking subtask

Dataset irds.codec.economics.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of codec that only contains topics about economics.

Dataset irds.codec.economics.qrels

Subset of codec that only contains topics about economics.

Dataset irds.codec.economics

Subset of codec that only contains topics about economics.

Dataset irds.codec.history.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of codec that only contains topics about history.

Dataset irds.codec.history.qrels

Subset of codec that only contains topics about history.

Dataset irds.codec.history

Subset of codec that only contains topics about history.

Dataset irds.codec.politics.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of codec that only contains topics about politics.

Dataset irds.codec.politics.qrels

Subset of codec that only contains topics about politics.

Dataset irds.codec.politics

Subset of codec that only contains topics about politics.

CORD-19

Collection of scientific articles related to COVID-19.

Uses the 2020-07-16 version of the dataset, corresponding to the "complete" collection used for TREC COVID.

Note that this version of the document collection only provides article meta-data. To get the full text, use cord19/fulltext.

Document collection site

Dataset irds.cord19.documents

Collection of scientific articles related to COVID-19.

Uses the 2020-07-16 version of the dataset, corresponding to the "complete" collection used for TREC COVID.

Note that this version of the document collection only provides article meta-data. To get the full text, use cord19/fulltext.

Document collection site

Dataset irds.cord19.trec-covid.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The Complete TREC COVID collection. Queries related to COVID-19, including deep relevance judgments.

Dataset irds.cord19.trec-covid.qrels

The Complete TREC COVID collection. Queries related to COVID-19, including deep relevance judgments.

Dataset irds.cord19.trec-covid

The Complete TREC COVID collection. Queries related to COVID-19, including deep relevance judgments.

Dataset irds.cord19.trec-covid.round5.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Round 5 of the TREC COVID task. Includes 50 queries related to COVID-19. This uses the "2020-07-16" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round5.qrels

Round 5 of the TREC COVID task. Includes 50 queries related to COVID-19. This uses the "2020-07-16" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round5

Round 5 of the TREC COVID task. Includes 50 queries related to COVID-19. This uses the "2020-07-16" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

cord19/fulltext

Version of cord19 dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.

Dataset irds.cord19.fulltext.documents

Version of cord19 dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.

Dataset irds.cord19.fulltext.trec-covid.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of cord19/trec-covid dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.

Queries and qrels are the same as cord19/trec-covid; it just uses the extended documents from cord19/fulltext.

Dataset irds.cord19.fulltext.trec-covid.qrels

Version of cord19/trec-covid dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.

Queries and qrels are the same as cord19/trec-covid; it just uses the extended documents from cord19/fulltext.

Dataset irds.cord19.fulltext.trec-covid

Version of cord19/trec-covid dataset that includes article full texts. This dataset takes longer to load than the version that only includes article meata-data.

Queries and qrels are the same as cord19/trec-covid; it just uses the extended documents from cord19/fulltext.

cord19/trec-covid/round1

Round 1 of the TREC COVID task. Includes 30 queries related to COVID-19. This uses the "2020-04-10" version of the collection.

Dataset irds.cord19.trec-covid.round1.documents

Round 1 of the TREC COVID task. Includes 30 queries related to COVID-19. This uses the "2020-04-10" version of the collection.

Dataset irds.cord19.trec-covid.round1.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Round 1 of the TREC COVID task. Includes 30 queries related to COVID-19. This uses the "2020-04-10" version of the collection.

Dataset irds.cord19.trec-covid.round1.qrels

Round 1 of the TREC COVID task. Includes 30 queries related to COVID-19. This uses the "2020-04-10" version of the collection.

Dataset irds.cord19.trec-covid.round1

Round 1 of the TREC COVID task. Includes 30 queries related to COVID-19. This uses the "2020-04-10" version of the collection.

cord19/trec-covid/round2

Round 2 of the TREC COVID task. Includes 35 queries related to COVID-19. This uses the "2020-05-01" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round2.documents

Round 2 of the TREC COVID task. Includes 35 queries related to COVID-19. This uses the "2020-05-01" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round2.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Round 2 of the TREC COVID task. Includes 35 queries related to COVID-19. This uses the "2020-05-01" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round2.qrels

Round 2 of the TREC COVID task. Includes 35 queries related to COVID-19. This uses the "2020-05-01" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round2

Round 2 of the TREC COVID task. Includes 35 queries related to COVID-19. This uses the "2020-05-01" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

cord19/trec-covid/round3

Round 3 of the TREC COVID task. Includes 40 queries related to COVID-19. This uses the "2020-05-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round3.documents

Round 3 of the TREC COVID task. Includes 40 queries related to COVID-19. This uses the "2020-05-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round3.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Round 3 of the TREC COVID task. Includes 40 queries related to COVID-19. This uses the "2020-05-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round3.qrels

Round 3 of the TREC COVID task. Includes 40 queries related to COVID-19. This uses the "2020-05-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round3

Round 3 of the TREC COVID task. Includes 40 queries related to COVID-19. This uses the "2020-05-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

cord19/trec-covid/round4

Round 4 of the TREC COVID task. Includes 45 queries related to COVID-19. This uses the "2020-06-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round4.documents

Round 4 of the TREC COVID task. Includes 45 queries related to COVID-19. This uses the "2020-06-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round4.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Round 4 of the TREC COVID task. Includes 45 queries related to COVID-19. This uses the "2020-06-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round4.qrels

Round 4 of the TREC COVID task. Includes 45 queries related to COVID-19. This uses the "2020-06-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Dataset irds.cord19.trec-covid.round4

Round 4 of the TREC COVID task. Includes 45 queries related to COVID-19. This uses the "2020-06-19" version of the collection.

Note that the qrels do not contain results from the prior round(s). Use the "complete" version for this setting (cord19/trec-covid).

Cranfield

A small corpus of 1,400 scientific abstracts.

Documents: Scientific abstracts
Queries: Natural language questions
Dataset Information

Dataset irds.cranfield.documents

A small corpus of 1,400 scientific abstracts.

Documents: Scientific abstracts
Queries: Natural language questions
Dataset Information

Dataset irds.cranfield.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A small corpus of 1,400 scientific abstracts.

Documents: Scientific abstracts
Queries: Natural language questions
Dataset Information

Dataset irds.cranfield.qrels

A small corpus of 1,400 scientific abstracts.

Documents: Scientific abstracts
Queries: Natural language questions
Dataset Information

Dataset irds.cranfield

A small corpus of 1,400 scientific abstracts.

Documents: Scientific abstracts
Queries: Natural language questions
Dataset Information

CSL

The CSL dataset, used for the TREC NueCLIR technical document task.

Dataset irds.csl.documents

The CSL dataset, used for the TREC NueCLIR technical document task.

Dataset irds.csl.trec-2023.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC NeuCLIR 2023 technical documen task.

Dataset irds.csl.trec-2023.qrels

The TREC NeuCLIR 2023 technical documen task.

Dataset irds.csl.trec-2023

The TREC NeuCLIR 2023 technical documen task.

disks45/nocr

A version of disks45 without the Congressional Record. This is the typical setting for tasks like TREC 7, TREC 8, and TREC Robust 2004.

Dataset irds.disks45.nocr.documents

A version of disks45 without the Congressional Record. This is the typical setting for tasks like TREC 7, TREC 8, and TREC Robust 2004.

Dataset irds.disks45.nocr.trec-robust-2004.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Robust retrieval task focuses on "improving the consistency of retrieval technology by focusing on poorly performing topics."

The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here.

Documents: News articles
Queries: keyword queries, descriptions, narratives
Relevance: Deep judgments
Task Overview Paper
See also: aquaint/trec-robust-2005

Dataset irds.disks45.nocr.trec-robust-2004.qrels

The TREC Robust retrieval task focuses on "improving the consistency of retrieval technology by focusing on poorly performing topics."

The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here.

Documents: News articles
Queries: keyword queries, descriptions, narratives
Relevance: Deep judgments
Task Overview Paper
See also: aquaint/trec-robust-2005

Dataset irds.disks45.nocr.trec-robust-2004

The TREC Robust retrieval task focuses on "improving the consistency of retrieval technology by focusing on poorly performing topics."

The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the documents, this collection is for research use only, which requires agreements to be filed with NIST. See details here.

Documents: News articles
Queries: keyword queries, descriptions, narratives
Relevance: Deep judgments
Task Overview Paper
See also: aquaint/trec-robust-2005

Dataset irds.disks45.nocr.trec-robust-2004.fold1.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Robust04 Fold 1 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold1.qrels

Robust04 Fold 1 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold1

Robust04 Fold 1 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold2.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Robust04 Fold 2 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold2.qrels

Robust04 Fold 2 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold2

Robust04 Fold 2 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold3.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Robust04 Fold 3 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold3.qrels

Robust04 Fold 3 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold3

Robust04 Fold 3 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold4.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Robust04 Fold 4 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold4.qrels

Robust04 Fold 4 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold4

Robust04 Fold 4 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold5.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Robust04 Fold 5 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold5.qrels

Robust04 Fold 5 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec-robust-2004.fold5

Robust04 Fold 5 (Title) proposed by Huston & Croft (2014) and used in numerous works

Dataset irds.disks45.nocr.trec7.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC 7 Adhoc Retrieval track.

Task Overview Paper

Dataset irds.disks45.nocr.trec7.qrels

The TREC 7 Adhoc Retrieval track.

Task Overview Paper

Dataset irds.disks45.nocr.trec7

The TREC 7 Adhoc Retrieval track.

Task Overview Paper

Dataset irds.disks45.nocr.trec8.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC 8 Adhoc Retrieval track.

Task Overview Paper

Dataset irds.disks45.nocr.trec8.qrels

The TREC 8 Adhoc Retrieval track.

Task Overview Paper

Dataset irds.disks45.nocr.trec8

The TREC 8 Adhoc Retrieval track.

Task Overview Paper

DPR Wiki100

A wikipedia dump from 20 December, 2018, split into passages of 100 words. Used in experiments in the DPR paper (and other subsequent works) for retrieval experiments over Q&A collections.

Dataset irds.dpr-w100.documents

A wikipedia dump from 20 December, 2018, split into passages of 100 words. Used in experiments in the DPR paper (and other subsequent works) for retrieval experiments over Q&A collections.

Dataset irds.dpr-w100.natural-questions.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Dev subset from the Natural Questions Q&A collection. This differs from the natural-questions/dev dataset in that it uses the full Wikipedia dump and additional filtering (described in the DPR paper) was applied.

See also: natural-questions

Dataset irds.dpr-w100.natural-questions.dev.qrels

Dev subset from the Natural Questions Q&A collection. This differs from the natural-questions/dev dataset in that it uses the full Wikipedia dump and additional filtering (described in the DPR paper) was applied.

See also: natural-questions

Dataset irds.dpr-w100.natural-questions.dev

Dev subset from the Natural Questions Q&A collection. This differs from the natural-questions/dev dataset in that it uses the full Wikipedia dump and additional filtering (described in the DPR paper) was applied.

See also: natural-questions

Dataset irds.dpr-w100.natural-questions.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Training subset from the Natural Questions Q&A collection. This differs from the natural-questions/train dataset in that it uses the full Wikipedia dump and additional filtering (described in the DPR paper) was applied.

See also: natural-questions

Dataset irds.dpr-w100.natural-questions.train.qrels

Training subset from the Natural Questions Q&A collection. This differs from the natural-questions/train dataset in that it uses the full Wikipedia dump and additional filtering (described in the DPR paper) was applied.

See also: natural-questions

Dataset irds.dpr-w100.natural-questions.train

Training subset from the Natural Questions Q&A collection. This differs from the natural-questions/train dataset in that it uses the full Wikipedia dump and additional filtering (described in the DPR paper) was applied.

See also: natural-questions

Dataset irds.dpr-w100.trivia-qa.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Dev subset from the Trivia QA dataset. Differing from the official Trivia QA collection, this uses the DPR Wikipedia dump as the source collection. Refer to the DPR paper for more details.

Dataset irds.dpr-w100.trivia-qa.dev.qrels

Dev subset from the Trivia QA dataset. Differing from the official Trivia QA collection, this uses the DPR Wikipedia dump as the source collection. Refer to the DPR paper for more details.

Dataset irds.dpr-w100.trivia-qa.dev

Dev subset from the Trivia QA dataset. Differing from the official Trivia QA collection, this uses the DPR Wikipedia dump as the source collection. Refer to the DPR paper for more details.

Dataset irds.dpr-w100.trivia-qa.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Training subset from the Trivia QA dataset. Differing from the official Trivia QA collection, this uses the DPR Wikipedia dump as the source collection. Refer to the DPR paper for more details.

Dataset irds.dpr-w100.trivia-qa.train.qrels

Training subset from the Trivia QA dataset. Differing from the official Trivia QA collection, this uses the DPR Wikipedia dump as the source collection. Refer to the DPR paper for more details.

Dataset irds.dpr-w100.trivia-qa.train

Training subset from the Trivia QA dataset. Differing from the official Trivia QA collection, this uses the DPR Wikipedia dump as the source collection. Refer to the DPR paper for more details.

CodeSearchNet

A benchmark for semantic code search. Uses

Documents: Code functions in python, java, go, php, ruby, and javascript
Queries: Inferred from docstrings, or
Dataset Paper
Challenge Task Leaderboard

Dataset irds.codesearchnet.documents

A benchmark for semantic code search. Uses

Documents: Code functions in python, java, go, php, ruby, and javascript
Queries: Inferred from docstrings, or
Dataset Paper
Challenge Task Leaderboard

Dataset irds.codesearchnet.challenge.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official challenge set, with keyword queries and deep relevance assessments.

Dataset irds.codesearchnet.challenge.qrels

Official challenge set, with keyword queries and deep relevance assessments.

Dataset irds.codesearchnet.challenge

Official challenge set, with keyword queries and deep relevance assessments.

Dataset irds.codesearchnet.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official test set, using queries inferred from docstrings.

Dataset irds.codesearchnet.test.qrels

Official test set, using queries inferred from docstrings.

Dataset irds.codesearchnet.test

Official test set, using queries inferred from docstrings.

Dataset irds.codesearchnet.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official train set, using queries inferred from docstrings.

Dataset irds.codesearchnet.train.qrels

Official train set, using queries inferred from docstrings.

Dataset irds.codesearchnet.train

Official train set, using queries inferred from docstrings.

Dataset irds.codesearchnet.valid.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official validation set, using queries inferred from docstrings.

Dataset irds.codesearchnet.valid.qrels

Official validation set, using queries inferred from docstrings.

Dataset irds.codesearchnet.valid

Official validation set, using queries inferred from docstrings.

GOV

GOV web document collection. Used for early TREC Web Tracks. Not to be confused with gov2.

The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.

Document collection site

Dataset irds.gov.documents

GOV web document collection. Used for early TREC Web Tracks. Not to be confused with gov2.

The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.

Document collection site

Dataset irds.gov.trec-web-2002.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2002 ad-hoc ranking benchmark.

Dataset irds.gov.trec-web-2002.qrels

The TREC Web Track 2002 ad-hoc ranking benchmark.

Dataset irds.gov.trec-web-2002

The TREC Web Track 2002 ad-hoc ranking benchmark.

Dataset irds.gov.trec-web-2002.named-page.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2002 named page ranking benchmark.

Dataset irds.gov.trec-web-2002.named-page.qrels

The TREC Web Track 2002 named page ranking benchmark.

Dataset irds.gov.trec-web-2002.named-page

The TREC Web Track 2002 named page ranking benchmark.

Dataset irds.gov.trec-web-2003.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2003 ad-hoc ranking benchmark.

Dataset irds.gov.trec-web-2003.qrels

The TREC Web Track 2003 ad-hoc ranking benchmark.

Dataset irds.gov.trec-web-2003

The TREC Web Track 2003 ad-hoc ranking benchmark.

Dataset irds.gov.trec-web-2003.named-page.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2003 named page ranking benchmark.

Dataset irds.gov.trec-web-2003.named-page.qrels

The TREC Web Track 2003 named page ranking benchmark.

Dataset irds.gov.trec-web-2003.named-page

The TREC Web Track 2003 named page ranking benchmark.

Dataset irds.gov.trec-web-2004.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Web Track 2004 ad-hoc ranking benchmark.

Queries include a combination of topic distillation, homepage finding, and named page finding.

Dataset irds.gov.trec-web-2004.qrels

The TREC Web Track 2004 ad-hoc ranking benchmark.

Queries include a combination of topic distillation, homepage finding, and named page finding.

Dataset irds.gov.trec-web-2004

The TREC Web Track 2004 ad-hoc ranking benchmark.

Queries include a combination of topic distillation, homepage finding, and named page finding.

GOV2

GOV2 web document collection. Used for the TREC Terabyte Track.

The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.

Document collection site

Dataset irds.gov2.documents

GOV2 web document collection. Used for the TREC Terabyte Track.

The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.

Document collection site

Dataset irds.gov2.trec-mq-2007.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

TREC 2007 Million Query track.

Dataset irds.gov2.trec-mq-2007.qrels

TREC 2007 Million Query track.

Dataset irds.gov2.trec-mq-2007

TREC 2007 Million Query track.

Dataset irds.gov2.trec-mq-2008.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

TREC 2008 Million Query track.

Dataset irds.gov2.trec-mq-2008.qrels

TREC 2008 Million Query track.

Dataset irds.gov2.trec-mq-2008

TREC 2008 Million Query track.

Dataset irds.gov2.trec-tb-2004.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Terabyte Track 2004 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.gov2.trec-tb-2004.qrels

The TREC Terabyte Track 2004 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.gov2.trec-tb-2004

The TREC Terabyte Track 2004 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.gov2.trec-tb-2005.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Terabyte Track 2005 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.gov2.trec-tb-2005.qrels

The TREC Terabyte Track 2005 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.gov2.trec-tb-2005

The TREC Terabyte Track 2005 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.gov2.trec-tb-2005.efficiency.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Terabyte Track 2005 efficiency ranking benchmark. Contains 50,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2005. Only the 50 topics have judgments.

Dataset irds.gov2.trec-tb-2005.efficiency.qrels

The TREC Terabyte Track 2005 efficiency ranking benchmark. Contains 50,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2005. Only the 50 topics have judgments.

Dataset irds.gov2.trec-tb-2005.efficiency

The TREC Terabyte Track 2005 efficiency ranking benchmark. Contains 50,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2005. Only the 50 topics have judgments.

Dataset irds.gov2.trec-tb-2005.named-page.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Terabyte Track 2005 named page ranking benchmark. Contains 252 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.

Dataset irds.gov2.trec-tb-2005.named-page.qrels

The TREC Terabyte Track 2005 named page ranking benchmark. Contains 252 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.

Dataset irds.gov2.trec-tb-2005.named-page

The TREC Terabyte Track 2005 named page ranking benchmark. Contains 252 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.

Dataset irds.gov2.trec-tb-2006.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Terabyte Track 2006 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.gov2.trec-tb-2006.qrels

The TREC Terabyte Track 2006 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.gov2.trec-tb-2006

The TREC Terabyte Track 2006 ad-hoc ranking benchmark. Contains 50 queries with deep relevance judgments.

Dataset irds.gov2.trec-tb-2006.efficiency.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Terabyte Track 2006 efficiency ranking benchmark. Contains 100,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2006. Only the 50 topics have judgments.

Dataset irds.gov2.trec-tb-2006.efficiency.qrels

The TREC Terabyte Track 2006 efficiency ranking benchmark. Contains 100,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2006. Only the 50 topics have judgments.

Dataset irds.gov2.trec-tb-2006.efficiency

The TREC Terabyte Track 2006 efficiency ranking benchmark. Contains 100,000 queries from a search engine, including the 50 topics from gov2/trec-tb-2006. Only the 50 topics have judgments.

Dataset irds.gov2.trec-tb-2006.efficiency.10k.queries

Small stream from gov2/trec-tb-2006/efficiency, with 10,000 queries.

Dataset irds.gov2.trec-tb-2006.efficiency.stream1.queries

Stream 1 of gov2/trec-tb-2006/efficiency (25,000 queries).

Dataset irds.gov2.trec-tb-2006.efficiency.stream2.queries

Stream 2 of gov2/trec-tb-2006/efficiency (25,000 queries).

Dataset irds.gov2.trec-tb-2006.efficiency.stream3.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Stream 3 of gov2/trec-tb-2006/efficiency (25,000 queries).

Dataset irds.gov2.trec-tb-2006.efficiency.stream3.qrels

Stream 3 of gov2/trec-tb-2006/efficiency (25,000 queries).

Dataset irds.gov2.trec-tb-2006.efficiency.stream3

Stream 3 of gov2/trec-tb-2006/efficiency (25,000 queries).

Dataset irds.gov2.trec-tb-2006.efficiency.stream4.queries

Stream 4 of gov2/trec-tb-2006/efficiency (25,000 queries).

Dataset irds.gov2.trec-tb-2006.named-page.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Terabyte Track 2006 named page ranking benchmark. Contains 181 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.

Dataset irds.gov2.trec-tb-2006.named-page.qrels

The TREC Terabyte Track 2006 named page ranking benchmark. Contains 181 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.

Dataset irds.gov2.trec-tb-2006.named-page

The TREC Terabyte Track 2006 named page ranking benchmark. Contains 181 queries with titles that resemble bookmark labels. Relevance judgments include near-duplicate pages and other pages that may satisfy the bookmark label.

Istella22

The Istella22 dataset facilitates comparisions between traditional and neural learning-to-rank by including query and document text along with LTR features (not included in ir_datasets).

Note that to use the dataset, you must read and accept the Istella22 License Agreement. By using the dataset, you agree to be bound by the terms of the license: the Istella dataset is solely for non-commercial use.

Dataset irds.istella22.documents

The Istella22 dataset facilitates comparisions between traditional and neural learning-to-rank by including query and document text along with LTR features (not included in ir_datasets).

Note that to use the dataset, you must read and accept the Istella22 License Agreement. By using the dataset, you agree to be bound by the terms of the license: the Istella dataset is solely for non-commercial use.

Dataset irds.istella22.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official test query set.

Dataset irds.istella22.test.qrels

Official test query set.

Dataset irds.istella22.test

Official test query set.

Dataset irds.istella22.test.fold1.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official test query set.

Dataset irds.istella22.test.fold1.qrels

Official test query set.

Dataset irds.istella22.test.fold1

Official test query set.

Dataset irds.istella22.test.fold2.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official test query set.

Dataset irds.istella22.test.fold2.qrels

Official test query set.

Dataset irds.istella22.test.fold2

Official test query set.

Dataset irds.istella22.test.fold3.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official test query set.

Dataset irds.istella22.test.fold3.qrels

Official test query set.

Dataset irds.istella22.test.fold3

Official test query set.

Dataset irds.istella22.test.fold4.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official test query set.

Dataset irds.istella22.test.fold4.qrels

Official test query set.

Dataset irds.istella22.test.fold4

Official test query set.

Dataset irds.istella22.test.fold5.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official test query set.

Dataset irds.istella22.test.fold5.qrels

Official test query set.

Dataset irds.istella22.test.fold5

Official test query set.

KILT

KILT is a corpus used for various "knowledge intensive language tasks".

Documents: Wikipedia articles
Repository
Paper
Leaderboard

Dataset irds.kilt.documents

KILT is a corpus used for various "knowledge intensive language tasks".

Documents: Wikipedia articles
Repository
Paper
Leaderboard

Dataset irds.kilt.codec.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

CODEC Entity Ranking sub-task.

Task Repository
See also: codec, the document ranking subtask

Dataset irds.kilt.codec.qrels

CODEC Entity Ranking sub-task.

Task Repository
See also: codec, the document ranking subtask

Dataset irds.kilt.codec

CODEC Entity Ranking sub-task.

Task Repository
See also: codec, the document ranking subtask

Dataset irds.kilt.codec.economics.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of codec that only contains topics about economics.

Dataset irds.kilt.codec.economics.qrels

Subset of codec that only contains topics about economics.

Dataset irds.kilt.codec.economics

Subset of codec that only contains topics about economics.

Dataset irds.kilt.codec.history.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of codec that only contains topics about history.

Dataset irds.kilt.codec.history.qrels

Subset of codec that only contains topics about history.

Dataset irds.kilt.codec.history

Subset of codec that only contains topics about history.

Dataset irds.kilt.codec.politics.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of codec that only contains topics about politics.

Dataset irds.kilt.codec.politics.qrels

Subset of codec that only contains topics about politics.

Dataset irds.kilt.codec.politics

Subset of codec that only contains topics about politics.

lotte/lifestyle/dev

Answers from lifestyle-focused forums, including bicycles, coffee, crafts, diy, gardening, lifehacks, mechanics, music, outdoors, parenting, pets, sports, and travel.

Dataset irds.lotte.lifestyle.dev.documents

Answers from lifestyle-focused forums, including bicycles, coffee, crafts, diy, gardening, lifehacks, mechanics, music, outdoors, parenting, pets, sports, and travel.

Dataset irds.lotte.lifestyle.dev.forum.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Forum queries for lotte/lifestyle/dev.

Dataset irds.lotte.lifestyle.dev.forum.qrels

Forum queries for lotte/lifestyle/dev.

Dataset irds.lotte.lifestyle.dev.forum

Forum queries for lotte/lifestyle/dev.

Dataset irds.lotte.lifestyle.dev.search.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Search queries for lotte/lifestyle/dev.

Dataset irds.lotte.lifestyle.dev.search.qrels

Search queries for lotte/lifestyle/dev.

Dataset irds.lotte.lifestyle.dev.search

Search queries for lotte/lifestyle/dev.

lotte/lifestyle/test

Queries and answers from lifestyle-focused forums, including bicycles, coffee, crafts, diy, gardening, lifehacks, mechanics, music, outdoors, parenting, pets, sports, and travel.

Dataset irds.lotte.lifestyle.test.documents

Queries and answers from lifestyle-focused forums, including bicycles, coffee, crafts, diy, gardening, lifehacks, mechanics, music, outdoors, parenting, pets, sports, and travel.

Dataset irds.lotte.lifestyle.test.forum.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Forum queries for lotte/lifestyle/test.

Dataset irds.lotte.lifestyle.test.forum.qrels

Forum queries for lotte/lifestyle/test.

Dataset irds.lotte.lifestyle.test.forum

Forum queries for lotte/lifestyle/test.

Dataset irds.lotte.lifestyle.test.search.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Search queries for lotte/lifestyle/test.

Dataset irds.lotte.lifestyle.test.search.qrels

Search queries for lotte/lifestyle/test.

Dataset irds.lotte.lifestyle.test.search

Search queries for lotte/lifestyle/test.

lotte/pooled/dev

Combined version of lotte/lifestyle/dev, lotte/recreation/dev, lotte/science/dev, lotte/technology/dev, and lotte/writing/dev.

Dataset irds.lotte.pooled.dev.documents

Combined version of lotte/lifestyle/dev, lotte/recreation/dev, lotte/science/dev, lotte/technology/dev, and lotte/writing/dev.

Dataset irds.lotte.pooled.dev.forum.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Forum queries for lotte/pooled/dev.

Dataset irds.lotte.pooled.dev.forum.qrels

Forum queries for lotte/pooled/dev.

Dataset irds.lotte.pooled.dev.forum

Forum queries for lotte/pooled/dev.

Dataset irds.lotte.pooled.dev.search.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Search queries for lotte/pooled/dev.

Dataset irds.lotte.pooled.dev.search.qrels

Search queries for lotte/pooled/dev.

Dataset irds.lotte.pooled.dev.search

Search queries for lotte/pooled/dev.

lotte/pooled/test

Combined version of lotte/lifestyle/test, lotte/recreation/test, lotte/science/test, lotte/technology/test, and lotte/writing/test.

Dataset irds.lotte.pooled.test.documents

Combined version of lotte/lifestyle/test, lotte/recreation/test, lotte/science/test, lotte/technology/test, and lotte/writing/test.

Dataset irds.lotte.pooled.test.forum.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Forum queries for lotte/pooled/test.

Dataset irds.lotte.pooled.test.forum.qrels

Forum queries for lotte/pooled/test.

Dataset irds.lotte.pooled.test.forum

Forum queries for lotte/pooled/test.

Dataset irds.lotte.pooled.test.search.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Search queries for lotte/pooled/test.

Dataset irds.lotte.pooled.test.search.qrels

Search queries for lotte/pooled/test.

Dataset irds.lotte.pooled.test.search

Search queries for lotte/pooled/test.

lotte/recreation/dev

Answers from recreation-focused forums, including anime, boardgames, gaming, movies, photo, rpg, and scifi.

Dataset irds.lotte.recreation.dev.documents

Answers from recreation-focused forums, including anime, boardgames, gaming, movies, photo, rpg, and scifi.

Dataset irds.lotte.recreation.dev.forum.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Forum queries for lotte/recreation/dev.

Dataset irds.lotte.recreation.dev.forum.qrels

Forum queries for lotte/recreation/dev.

Dataset irds.lotte.recreation.dev.forum

Forum queries for lotte/recreation/dev.

Dataset irds.lotte.recreation.dev.search.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Search queries for lotte/recreation/dev.

Dataset irds.lotte.recreation.dev.search.qrels

Search queries for lotte/recreation/dev.

Dataset irds.lotte.recreation.dev.search

Search queries for lotte/recreation/dev.

lotte/recreation/test

Answers from recreation-focused forums, including anime, boardgames, gaming, movies, photo, rpg, and scifi.

Dataset irds.lotte.recreation.test.documents

Answers from recreation-focused forums, including anime, boardgames, gaming, movies, photo, rpg, and scifi.

Dataset irds.lotte.recreation.test.forum.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Forum queries for lotte/recreation/test.

Dataset irds.lotte.recreation.test.forum.qrels

Forum queries for lotte/recreation/test.

Dataset irds.lotte.recreation.test.forum

Forum queries for lotte/recreation/test.

Dataset irds.lotte.recreation.test.search.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Search queries for lotte/recreation/test.

Dataset irds.lotte.recreation.test.search.qrels

Search queries for lotte/recreation/test.

Dataset irds.lotte.recreation.test.search

Search queries for lotte/recreation/test.

lotte/science/dev

Answers from science-focused forums, including academia, astronomy, biology, chemistry, datasciene, earthscience, engineering, math, philosophy, physics, and stats.

Dataset irds.lotte.science.dev.documents

Answers from science-focused forums, including academia, astronomy, biology, chemistry, datasciene, earthscience, engineering, math, philosophy, physics, and stats.

Dataset irds.lotte.science.dev.forum.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Forum queries for lotte/science/dev.

Dataset irds.lotte.science.dev.forum.qrels

Forum queries for lotte/science/dev.

Dataset irds.lotte.science.dev.forum

Forum queries for lotte/science/dev.

Dataset irds.lotte.science.dev.search.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Search queries for lotte/science/dev.

Dataset irds.lotte.science.dev.search.qrels

Search queries for lotte/science/dev.

Dataset irds.lotte.science.dev.search

Search queries for lotte/science/dev.

lotte/science/test

Answers from science-focused forums, including academia, astronomy, biology, chemistry, datasciene, earthscience, engineering, math, philosophy, physics, and stats.

Dataset irds.lotte.science.test.documents

Answers from science-focused forums, including academia, astronomy, biology, chemistry, datasciene, earthscience, engineering, math, philosophy, physics, and stats.

Dataset irds.lotte.science.test.forum.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Forum queries for lotte/science/test.

Dataset irds.lotte.science.test.forum.qrels

Forum queries for lotte/science/test.

Dataset irds.lotte.science.test.forum

Forum queries for lotte/science/test.

Dataset irds.lotte.science.test.search.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Search queries for lotte/science/test.

Dataset irds.lotte.science.test.search.qrels

Search queries for lotte/science/test.

Dataset irds.lotte.science.test.search

Search queries for lotte/science/test.

lotte/technology/dev

Answers from technology-focused forums, including android, apple, askubuntu, electronics, networkengineering, security, serverfault, softwareengineering, superuser, unix, and webapps.

Dataset irds.lotte.technology.dev.documents

Answers from technology-focused forums, including android, apple, askubuntu, electronics, networkengineering, security, serverfault, softwareengineering, superuser, unix, and webapps.

Dataset irds.lotte.technology.dev.forum.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Forum queries for lotte/technology/dev.

Dataset irds.lotte.technology.dev.forum.qrels

Forum queries for lotte/technology/dev.

Dataset irds.lotte.technology.dev.forum

Forum queries for lotte/technology/dev.

Dataset irds.lotte.technology.dev.search.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Search queries for lotte/technology/dev.

Dataset irds.lotte.technology.dev.search.qrels

Search queries for lotte/technology/dev.

Dataset irds.lotte.technology.dev.search

Search queries for lotte/technology/dev.

lotte/technology/test

Answers from technology-focused forums, including android, apple, askubuntu, electronics, networkengineering, security, serverfault, softwareengineering, superuser, unix, and webapps.

Dataset irds.lotte.technology.test.documents

Answers from technology-focused forums, including android, apple, askubuntu, electronics, networkengineering, security, serverfault, softwareengineering, superuser, unix, and webapps.

Dataset irds.lotte.technology.test.forum.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Forum queries for lotte/technology/test.

Dataset irds.lotte.technology.test.forum.qrels

Forum queries for lotte/technology/test.

Dataset irds.lotte.technology.test.forum

Forum queries for lotte/technology/test.

Dataset irds.lotte.technology.test.search.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Search queries for lotte/technology/test.

Dataset irds.lotte.technology.test.search.qrels

Search queries for lotte/technology/test.

Dataset irds.lotte.technology.test.search

Search queries for lotte/technology/test.

lotte/writing/dev

Answers from writing-focused forums, including ell, english, linguistics, literature, worldbuilding, and writing.

Dataset irds.lotte.writing.dev.documents

Answers from writing-focused forums, including ell, english, linguistics, literature, worldbuilding, and writing.

Dataset irds.lotte.writing.dev.forum.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Forum queries for lotte/writing/dev.

Dataset irds.lotte.writing.dev.forum.qrels

Forum queries for lotte/writing/dev.

Dataset irds.lotte.writing.dev.forum

Forum queries for lotte/writing/dev.

Dataset irds.lotte.writing.dev.search.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Search queries for lotte/writing/dev.

Dataset irds.lotte.writing.dev.search.qrels

Search queries for lotte/writing/dev.

Dataset irds.lotte.writing.dev.search

Search queries for lotte/writing/dev.

lotte/writing/test

Answers from writing-focused forums, including ell, english, linguistics, literature, worldbuilding, and writing.

Dataset irds.lotte.writing.test.documents

Answers from writing-focused forums, including ell, english, linguistics, literature, worldbuilding, and writing.

Dataset irds.lotte.writing.test.forum.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Forum queries for lotte/writing/test.

Dataset irds.lotte.writing.test.forum.qrels

Forum queries for lotte/writing/test.

Dataset irds.lotte.writing.test.forum

Forum queries for lotte/writing/test.

Dataset irds.lotte.writing.test.search.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Search queries for lotte/writing/test.

Dataset irds.lotte.writing.test.search.qrels

Search queries for lotte/writing/test.

Dataset irds.lotte.writing.test.search

Search queries for lotte/writing/test.

miracl/ar

The Arabic corpus.

Dataset irds.miracl.ar.documents

The Arabic corpus.

Dataset irds.miracl.ar.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Arabic.

Dataset irds.miracl.ar.dev.qrels

The dev set for Arabic.

Dataset irds.miracl.ar.dev

The dev set for Arabic.

Dataset irds.miracl.ar.test-a.queries

The held-out test set (version a) for Arabic.

Dataset irds.miracl.ar.test-b.queries

The held-out test set (version b) for Arabic.

Dataset irds.miracl.ar.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Arabic.

Dataset irds.miracl.ar.train.qrels

The train set for Arabic.

Dataset irds.miracl.ar.train

The train set for Arabic.

miracl/bn

The Bengali corpus.

Dataset irds.miracl.bn.documents

The Bengali corpus.

Dataset irds.miracl.bn.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Bengali.

Dataset irds.miracl.bn.dev.qrels

The dev set for Bengali.

Dataset irds.miracl.bn.dev

The dev set for Bengali.

Dataset irds.miracl.bn.test-a.queries

The held-out test set (version a) for Bengali.

Dataset irds.miracl.bn.test-b.queries

The held-out test set (version b) for Bengali.

Dataset irds.miracl.bn.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Bengali.

Dataset irds.miracl.bn.train.qrels

The train set for Bengali.

Dataset irds.miracl.bn.train

The train set for Bengali.

miracl/de

The German corpus.

Dataset irds.miracl.de.documents

The German corpus.

Dataset irds.miracl.de.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for German.

Dataset irds.miracl.de.dev.qrels

The dev set for German.

Dataset irds.miracl.de.dev

The dev set for German.

Dataset irds.miracl.de.test-b.queries

The held-out test set (version b) for German.

miracl/en

The English corpus.

Dataset irds.miracl.en.documents

The English corpus.

Dataset irds.miracl.en.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for English.

Dataset irds.miracl.en.dev.qrels

The dev set for English.

Dataset irds.miracl.en.dev

The dev set for English.

Dataset irds.miracl.en.test-a.queries

The held-out test set (version a) for English.

Dataset irds.miracl.en.test-b.queries

The held-out test set (version b) for English.

Dataset irds.miracl.en.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for English.

Dataset irds.miracl.en.train.qrels

The train set for English.

Dataset irds.miracl.en.train

The train set for English.

miracl/es

The Spanish corpus.

Dataset irds.miracl.es.documents

The Spanish corpus.

Dataset irds.miracl.es.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Spanish.

Dataset irds.miracl.es.dev.qrels

The dev set for Spanish.

Dataset irds.miracl.es.dev

The dev set for Spanish.

Dataset irds.miracl.es.test-b.queries

The held-out test set (version b) for Spanish.

Dataset irds.miracl.es.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Spanish.

Dataset irds.miracl.es.train.qrels

The train set for Spanish.

Dataset irds.miracl.es.train

The train set for Spanish.

miracl/fa

The Persian corpus.

Dataset irds.miracl.fa.documents

The Persian corpus.

Dataset irds.miracl.fa.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Persian.

Dataset irds.miracl.fa.dev.qrels

The dev set for Persian.

Dataset irds.miracl.fa.dev

The dev set for Persian.

Dataset irds.miracl.fa.test-b.queries

The held-out test set (version b) for Persian.

Dataset irds.miracl.fa.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Persian.

Dataset irds.miracl.fa.train.qrels

The train set for Persian.

Dataset irds.miracl.fa.train

The train set for Persian.

miracl/fi

The Finnish corpus.

Dataset irds.miracl.fi.documents

The Finnish corpus.

Dataset irds.miracl.fi.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Finnish.

Dataset irds.miracl.fi.dev.qrels

The dev set for Finnish.

Dataset irds.miracl.fi.dev

The dev set for Finnish.

Dataset irds.miracl.fi.test-a.queries

The held-out test set (version a) for Finnish.

Dataset irds.miracl.fi.test-b.queries

The held-out test set (version b) for Finnish.

Dataset irds.miracl.fi.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Finnish.

Dataset irds.miracl.fi.train.qrels

The train set for Finnish.

Dataset irds.miracl.fi.train

The train set for Finnish.

miracl/fr

The French corpus.

Dataset irds.miracl.fr.documents

The French corpus.

Dataset irds.miracl.fr.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for French.

Dataset irds.miracl.fr.dev.qrels

The dev set for French.

Dataset irds.miracl.fr.dev

The dev set for French.

Dataset irds.miracl.fr.test-b.queries

The held-out test set (version b) for French.

Dataset irds.miracl.fr.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for French.

Dataset irds.miracl.fr.train.qrels

The train set for French.

Dataset irds.miracl.fr.train

The train set for French.

miracl/hi

The Hindi corpus.

Dataset irds.miracl.hi.documents

The Hindi corpus.

Dataset irds.miracl.hi.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Hindi.

Dataset irds.miracl.hi.dev.qrels

The dev set for Hindi.

Dataset irds.miracl.hi.dev

The dev set for Hindi.

Dataset irds.miracl.hi.test-b.queries

The held-out test set (version b) for Hindi.

Dataset irds.miracl.hi.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Hindi.

Dataset irds.miracl.hi.train.qrels

The train set for Hindi.

Dataset irds.miracl.hi.train

The train set for Hindi.

miracl/id

The Indonesian corpus.

Dataset irds.miracl.id.documents

The Indonesian corpus.

Dataset irds.miracl.id.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Indonesian.

Dataset irds.miracl.id.dev.qrels

The dev set for Indonesian.

Dataset irds.miracl.id.dev

The dev set for Indonesian.

Dataset irds.miracl.id.test-a.queries

The held-out test set (version a) for Indonesian.

Dataset irds.miracl.id.test-b.queries

The held-out test set (version b) for Indonesian.

Dataset irds.miracl.id.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Indonesian.

Dataset irds.miracl.id.train.qrels

The train set for Indonesian.

Dataset irds.miracl.id.train

The train set for Indonesian.

miracl/ja

The Japanese corpus.

Dataset irds.miracl.ja.documents

The Japanese corpus.

Dataset irds.miracl.ja.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Japanese.

Dataset irds.miracl.ja.dev.qrels

The dev set for Japanese.

Dataset irds.miracl.ja.dev

The dev set for Japanese.

Dataset irds.miracl.ja.test-a.queries

The held-out test set (version a) for Japanese.

Dataset irds.miracl.ja.test-b.queries

The held-out test set (version b) for Japanese.

Dataset irds.miracl.ja.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Japanese.

Dataset irds.miracl.ja.train.qrels

The train set for Japanese.

Dataset irds.miracl.ja.train

The train set for Japanese.

miracl/ko

The Korean corpus.

Dataset irds.miracl.ko.documents

The Korean corpus.

Dataset irds.miracl.ko.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Korean.

Dataset irds.miracl.ko.dev.qrels

The dev set for Korean.

Dataset irds.miracl.ko.dev

The dev set for Korean.

Dataset irds.miracl.ko.test-a.queries

The held-out test set (version a) for Korean.

Dataset irds.miracl.ko.test-b.queries

The held-out test set (version b) for Korean.

Dataset irds.miracl.ko.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Korean.

Dataset irds.miracl.ko.train.qrels

The train set for Korean.

Dataset irds.miracl.ko.train

The train set for Korean.

miracl/ru

The Russian corpus.

Dataset irds.miracl.ru.documents

The Russian corpus.

Dataset irds.miracl.ru.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Russian.

Dataset irds.miracl.ru.dev.qrels

The dev set for Russian.

Dataset irds.miracl.ru.dev

The dev set for Russian.

Dataset irds.miracl.ru.test-a.queries

The held-out test set (version a) for Russian.

Dataset irds.miracl.ru.test-b.queries

The held-out test set (version b) for Russian.

Dataset irds.miracl.ru.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Russian.

Dataset irds.miracl.ru.train.qrels

The train set for Russian.

Dataset irds.miracl.ru.train

The train set for Russian.

miracl/sw

The Swahili corpus.

Dataset irds.miracl.sw.documents

The Swahili corpus.

Dataset irds.miracl.sw.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Swahili.

Dataset irds.miracl.sw.dev.qrels

The dev set for Swahili.

Dataset irds.miracl.sw.dev

The dev set for Swahili.

Dataset irds.miracl.sw.test-a.queries

The held-out test set (version a) for Swahili.

Dataset irds.miracl.sw.test-b.queries

The held-out test set (version b) for Swahili.

Dataset irds.miracl.sw.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Swahili.

Dataset irds.miracl.sw.train.qrels

The train set for Swahili.

Dataset irds.miracl.sw.train

The train set for Swahili.

miracl/te

The Telugu corpus.

Dataset irds.miracl.te.documents

The Telugu corpus.

Dataset irds.miracl.te.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Telugu.

Dataset irds.miracl.te.dev.qrels

The dev set for Telugu.

Dataset irds.miracl.te.dev

The dev set for Telugu.

Dataset irds.miracl.te.test-a.queries

The held-out test set (version a) for Telugu.

Dataset irds.miracl.te.test-b.queries

The held-out test set (version b) for Telugu.

Dataset irds.miracl.te.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Telugu.

Dataset irds.miracl.te.train.qrels

The train set for Telugu.

Dataset irds.miracl.te.train

The train set for Telugu.

miracl/th

The Thai corpus.

Dataset irds.miracl.th.documents

The Thai corpus.

Dataset irds.miracl.th.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Thai.

Dataset irds.miracl.th.dev.qrels

The dev set for Thai.

Dataset irds.miracl.th.dev

The dev set for Thai.

Dataset irds.miracl.th.test-a.queries

The held-out test set (version a) for Thai.

Dataset irds.miracl.th.test-b.queries

The held-out test set (version b) for Thai.

Dataset irds.miracl.th.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Thai.

Dataset irds.miracl.th.train.qrels

The train set for Thai.

Dataset irds.miracl.th.train

The train set for Thai.

miracl/yo

The Yoruba corpus.

Dataset irds.miracl.yo.documents

The Yoruba corpus.

Dataset irds.miracl.yo.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Yoruba.

Dataset irds.miracl.yo.dev.qrels

The dev set for Yoruba.

Dataset irds.miracl.yo.dev

The dev set for Yoruba.

Dataset irds.miracl.yo.test-b.queries

The held-out test set (version b) for Yoruba.

miracl/zh

The Chinese corpus.

Dataset irds.miracl.zh.documents

The Chinese corpus.

Dataset irds.miracl.zh.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The dev set for Chinese.

Dataset irds.miracl.zh.dev.qrels

The dev set for Chinese.

Dataset irds.miracl.zh.dev

The dev set for Chinese.

Dataset irds.miracl.zh.test-b.queries

The held-out test set (version b) for Chinese.

Dataset irds.miracl.zh.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The train set for Chinese.

Dataset irds.miracl.zh.train.qrels

The train set for Chinese.

Dataset irds.miracl.zh.train

The train set for Chinese.

MSMARCO (passage)

A passage ranking benchmark with a collection of 8.8 million passages and question queries. Most relevance judgments are shallow (typically at most 1-2 per query), but the TREC Deep Learning track adds deep judgments. Evaluation typically conducted using MRR@10.

Note that the original document source files for this collection contain a double-encoding error that cause strange sequences like "å¬" and "ðºð". These are automatically corrrected (properly converting previous examples to "公" and "🇺🇸").

See also: msmarco-document
Documents: Short passages (from web)
Queries: Natural language questions (from query log)
Leaderboard
Dataset Paper

Dataset irds.msmarco-passage.documents

A passage ranking benchmark with a collection of 8.8 million passages and question queries. Most relevance judgments are shallow (typically at most 1-2 per query), but the TREC Deep Learning track adds deep judgments. Evaluation typically conducted using MRR@10.

Note that the original document source files for this collection contain a double-encoding error that cause strange sequences like "å¬" and "ðºð". These are automatically corrrected (properly converting previous examples to "公" and "🇺🇸").

See also: msmarco-document
Documents: Short passages (from web)
Queries: Natural language questions (from query log)
Leaderboard
Dataset Paper

Dataset irds.msmarco-passage.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official dev set.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total available dev queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

Dataset irds.msmarco-passage.dev.qrels

Official dev set.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total available dev queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

Dataset irds.msmarco-passage.dev

Official dev set.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total available dev queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

Dataset irds.msmarco-passage.dev.2.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

"Dev2" split of the msmarco-passage/dev set. Originally released as part of the v2 corpus.

Dataset irds.msmarco-passage.dev.2.qrels

"Dev2" split of the msmarco-passage/dev set. Originally released as part of the v2 corpus.

Dataset irds.msmarco-passage.dev.2

"Dev2" split of the msmarco-passage/dev set. Originally released as part of the v2 corpus.

Dataset irds.msmarco-passage.dev.judged.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of msmarco-passage/dev that only includes queries that have at least one qrel.

Dataset irds.msmarco-passage.dev.judged.qrels

Subset of msmarco-passage/dev that only includes queries that have at least one qrel.

Dataset irds.msmarco-passage.dev.judged

Subset of msmarco-passage/dev that only includes queries that have at least one qrel.

Dataset irds.msmarco-passage.dev.small.queries

Official "small" version of the dev set, consisting of 6,980 queries (6.9% of the full dev set).

Dataset irds.msmarco-passage.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official "small" version of the dev set, consisting of 6,980 queries (6.9% of the full dev set).

Dataset irds.msmarco-passage.dev.small.qrels

Official "small" version of the dev set, consisting of 6,980 queries (6.9% of the full dev set).

Dataset irds.msmarco-passage.dev.small

Official "small" version of the dev set, consisting of 6,980 queries (6.9% of the full dev set).

Dataset irds.msmarco-passage.eval.queries

Official eval set for submission to MS MARCO leaderboard. Relevance judgments are hidden.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total available eval queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

Dataset irds.msmarco-passage.eval.small.queries

Official "small" version of the eval set, consisting of 6,837 queries (6.8% of the full eval set).

Dataset irds.msmarco-passage.eval.small.scoreddocs

Official "small" version of the eval set, consisting of 6,837 queries (6.8% of the full eval set).

Dataset irds.msmarco-passage.train.queries

Official train set.

Not all queries have relevance judgments. Use msmarco-passage/train/judged for a filtered list that only includes documents that have at least one qrel.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total available train queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

docpairs provides access to the "official" sequence for pairwise training.

Dataset irds.msmarco-passage.train.docpairs

Official train set.

Not all queries have relevance judgments. Use msmarco-passage/train/judged for a filtered list that only includes documents that have at least one qrel.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total available train queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

docpairs provides access to the "official" sequence for pairwise training.

Dataset irds.msmarco-passage.train.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official train set.

Not all queries have relevance judgments. Use msmarco-passage/train/judged for a filtered list that only includes documents that have at least one qrel.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total available train queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

docpairs provides access to the "official" sequence for pairwise training.

Dataset irds.msmarco-passage.train.qrels

Official train set.

Not all queries have relevance judgments. Use msmarco-passage/train/judged for a filtered list that only includes documents that have at least one qrel.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total available train queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

docpairs provides access to the "official" sequence for pairwise training.

Dataset irds.msmarco-passage.train

Official train set.

Not all queries have relevance judgments. Use msmarco-passage/train/judged for a filtered list that only includes documents that have at least one qrel.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total available train queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

docpairs provides access to the "official" sequence for pairwise training.

Dataset irds.msmarco-passage.train.judged.queries

Subset of msmarco-passage/train that only includes queries that have at least one qrel.

Dataset irds.msmarco-passage.train.judged.docpairs: Subset of msmarco-passage/train that only includes queries that have at least one qrel.

Dataset irds.msmarco-passage.train.judged.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of msmarco-passage/train that only includes queries that have at least one qrel.

Dataset irds.msmarco-passage.train.judged.qrels

Subset of msmarco-passage/train that only includes queries that have at least one qrel.

Dataset irds.msmarco-passage.train.judged

Subset of msmarco-passage/train that only includes queries that have at least one qrel.

Dataset irds.msmarco-passage.train.medical.queries

Subset of msmarco-passage/train that only includes queries that have a layman or expert medical term. Note that this includes about 20% false matches due to terms with multiple senses.

Dataset irds.msmarco-passage.train.medical.docpairs: Subset of msmarco-passage/train that only includes queries that have a layman or expert medical term. Note that this includes about 20% false matches due to terms with multiple senses.

Dataset irds.msmarco-passage.train.medical.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of msmarco-passage/train that only includes queries that have a layman or expert medical term. Note that this includes about 20% false matches due to terms with multiple senses.

Dataset irds.msmarco-passage.train.medical.qrels

Subset of msmarco-passage/train that only includes queries that have a layman or expert medical term. Note that this includes about 20% false matches due to terms with multiple senses.

Dataset irds.msmarco-passage.train.medical

Subset of msmarco-passage/train that only includes queries that have a layman or expert medical term. Note that this includes about 20% false matches due to terms with multiple senses.

Dataset irds.msmarco-passage.train.split200-train.queries

Subset of msmarco-passage/train without 200 queries that are meant to be used as a small validation set. From various works.

Dataset irds.msmarco-passage.train.split200-train.docpairs: Subset of msmarco-passage/train without 200 queries that are meant to be used as a small validation set. From various works.

Dataset irds.msmarco-passage.train.split200-train.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of msmarco-passage/train without 200 queries that are meant to be used as a small validation set. From various works.

Dataset irds.msmarco-passage.train.split200-train.qrels

Subset of msmarco-passage/train without 200 queries that are meant to be used as a small validation set. From various works.

Dataset irds.msmarco-passage.train.split200-train

Subset of msmarco-passage/train without 200 queries that are meant to be used as a small validation set. From various works.

Dataset irds.msmarco-passage.train.split200-valid.queries

Subset of msmarco-passage/train with only 200 queries that are meant to be used as a small validation set. From various works.

Dataset irds.msmarco-passage.train.split200-valid.docpairs: Subset of msmarco-passage/train with only 200 queries that are meant to be used as a small validation set. From various works.

Dataset irds.msmarco-passage.train.split200-valid.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of msmarco-passage/train with only 200 queries that are meant to be used as a small validation set. From various works.

Dataset irds.msmarco-passage.train.split200-valid.qrels

Subset of msmarco-passage/train with only 200 queries that are meant to be used as a small validation set. From various works.

Dataset irds.msmarco-passage.train.split200-valid

Subset of msmarco-passage/train with only 200 queries that are meant to be used as a small validation set. From various works.

Dataset irds.msmarco-passage.train.triples-small.queries

Version of msmarco-passage/train, but with the "small" triples file (a 10% sample of the full file).

Note that to save on storage space (27GB), the contents of the file are mapped to their corresponding query and document IDs. This process takes a few minutes to run the first time the triples are requested.

Dataset irds.msmarco-passage.train.triples-small.docpairs

Version of msmarco-passage/train, but with the "small" triples file (a 10% sample of the full file).

Note that to save on storage space (27GB), the contents of the file are mapped to their corresponding query and document IDs. This process takes a few minutes to run the first time the triples are requested.

Dataset irds.msmarco-passage.train.triples-small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, but with the "small" triples file (a 10% sample of the full file).

Note that to save on storage space (27GB), the contents of the file are mapped to their corresponding query and document IDs. This process takes a few minutes to run the first time the triples are requested.

Dataset irds.msmarco-passage.train.triples-small.qrels

Version of msmarco-passage/train, but with the "small" triples file (a 10% sample of the full file).

Note that to save on storage space (27GB), the contents of the file are mapped to their corresponding query and document IDs. This process takes a few minutes to run the first time the triples are requested.

Dataset irds.msmarco-passage.train.triples-small

Version of msmarco-passage/train, but with the "small" triples file (a 10% sample of the full file).

Note that to save on storage space (27GB), the contents of the file are mapped to their corresponding query and document IDs. This process takes a few minutes to run the first time the triples are requested.

Dataset irds.msmarco-passage.train.triples-v2.queries

Version of msmarco-passage/train, but with version 2 of the triples file.

This version of the triples file includes rows that were accidently missing from version 1 of the file (see discussion here).

Note that this is sorted by the IDs in the file, so you probably would not want to use it unless you first shuffle it before usage. We opened an issue suggesting that a third version of the file is provided that is shuffled so that the order is consistent across groups using the data, but at this time, no such file exists in an official capacity.

Dataset irds.msmarco-passage.train.triples-v2.docpairs

Version of msmarco-passage/train, but with version 2 of the triples file.

This version of the triples file includes rows that were accidently missing from version 1 of the file (see discussion here).

Note that this is sorted by the IDs in the file, so you probably would not want to use it unless you first shuffle it before usage. We opened an issue suggesting that a third version of the file is provided that is shuffled so that the order is consistent across groups using the data, but at this time, no such file exists in an official capacity.

Dataset irds.msmarco-passage.train.triples-v2.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, but with version 2 of the triples file.

This version of the triples file includes rows that were accidently missing from version 1 of the file (see discussion here).

Note that this is sorted by the IDs in the file, so you probably would not want to use it unless you first shuffle it before usage. We opened an issue suggesting that a third version of the file is provided that is shuffled so that the order is consistent across groups using the data, but at this time, no such file exists in an official capacity.

Dataset irds.msmarco-passage.train.triples-v2.qrels

Version of msmarco-passage/train, but with version 2 of the triples file.

This version of the triples file includes rows that were accidently missing from version 1 of the file (see discussion here).

Note that this is sorted by the IDs in the file, so you probably would not want to use it unless you first shuffle it before usage. We opened an issue suggesting that a third version of the file is provided that is shuffled so that the order is consistent across groups using the data, but at this time, no such file exists in an official capacity.

Dataset irds.msmarco-passage.train.triples-v2

Version of msmarco-passage/train, but with version 2 of the triples file.

This version of the triples file includes rows that were accidently missing from version 1 of the file (see discussion here).

Note that this is sorted by the IDs in the file, so you probably would not want to use it unless you first shuffle it before usage. We opened an issue suggesting that a third version of the file is provided that is shuffled so that the order is consistent across groups using the data, but at this time, no such file exists in an official capacity.

Dataset irds.msmarco-passage.trec-dl-2019.queries

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2019/judged).

Shared Task Paper

Dataset irds.msmarco-passage.trec-dl-2019.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2019/judged).

Shared Task Paper

Dataset irds.msmarco-passage.trec-dl-2019.qrels

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2019/judged).

Shared Task Paper

Dataset irds.msmarco-passage.trec-dl-2019

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2019/judged).

Shared Task Paper

Dataset irds.msmarco-passage.trec-dl-2019.judged.queries

Subset of msmarco-passage/trec-dl-2019, only including queries with qrels.

Dataset irds.msmarco-passage.trec-dl-2019.judged.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of msmarco-passage/trec-dl-2019, only including queries with qrels.

Dataset irds.msmarco-passage.trec-dl-2019.judged.qrels

Subset of msmarco-passage/trec-dl-2019, only including queries with qrels.

Dataset irds.msmarco-passage.trec-dl-2019.judged

Subset of msmarco-passage/trec-dl-2019, only including queries with qrels.

Dataset irds.msmarco-passage.trec-dl-2020.queries

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2020/judged).

Shared Task Paper

Dataset irds.msmarco-passage.trec-dl-2020.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2020/judged).

Shared Task Paper

Dataset irds.msmarco-passage.trec-dl-2020.qrels

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2020/judged).

Shared Task Paper

Dataset irds.msmarco-passage.trec-dl-2020

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2020/judged).

Shared Task Paper

Dataset irds.msmarco-passage.trec-dl-2020.judged.queries

Subset of msmarco-passage/trec-dl-2020, only including queries with qrels.

Dataset irds.msmarco-passage.trec-dl-2020.judged.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of msmarco-passage/trec-dl-2020, only including queries with qrels.

Dataset irds.msmarco-passage.trec-dl-2020.judged.qrels

Subset of msmarco-passage/trec-dl-2020, only including queries with qrels.

Dataset irds.msmarco-passage.trec-dl-2020.judged

Subset of msmarco-passage/trec-dl-2020, only including queries with qrels.

Dataset irds.msmarco-passage.trec-dl-hard.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.

data website
See Also: msmarco-document/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.qrels

A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.

data website
See Also: msmarco-document/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard

A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.

data website
See Also: msmarco-document/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold1.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 1 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold1.qrels

Fold 1 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold1

Fold 1 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold2.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 2 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold2.qrels

Fold 2 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold2

Fold 2 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold3.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 3 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold3.qrels

Fold 3 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold3

Fold 3 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold4.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 4 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold4.qrels

Fold 4 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold4

Fold 4 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold5.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 5 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold5.qrels

Fold 5 of msmarco-passage/trec-dl-hard

Dataset irds.msmarco-passage.trec-dl-hard.fold5

Fold 5 of msmarco-passage/trec-dl-hard

mmarco/de

Version of msmarco-passage, with documents translated into German.

Dataset irds.mmarco.de.documents

Version of msmarco-passage, with documents translated into German.

Dataset irds.mmarco.de.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into German.

Dataset irds.mmarco.de.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into German.

Dataset irds.mmarco.de.dev

Version of msmarco-passage/dev, with queries and documents translated into German.

Dataset irds.mmarco.de.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into German.

Dataset irds.mmarco.de.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into German.

Dataset irds.mmarco.de.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into German.

Dataset irds.mmarco.de.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into German.

Dataset irds.mmarco.de.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into German.

Dataset irds.mmarco.de.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into German.

Dataset irds.mmarco.de.train.qrels

Version of msmarco-passage/train, with queries and documents translated into German.

Dataset irds.mmarco.de.train

Version of msmarco-passage/train, with queries and documents translated into German.

mmarco/es

Version of msmarco-passage, with documents translated into Spanish.

Dataset irds.mmarco.es.documents

Version of msmarco-passage, with documents translated into Spanish.

Dataset irds.mmarco.es.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Spanish.

Dataset irds.mmarco.es.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Spanish.

Dataset irds.mmarco.es.dev

Version of msmarco-passage/dev, with queries and documents translated into Spanish.

Dataset irds.mmarco.es.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Spanish.

Dataset irds.mmarco.es.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Spanish.

Dataset irds.mmarco.es.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Spanish.

Dataset irds.mmarco.es.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Spanish.

Dataset irds.mmarco.es.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Spanish.

Dataset irds.mmarco.es.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Spanish.

Dataset irds.mmarco.es.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Spanish.

Dataset irds.mmarco.es.train

Version of msmarco-passage/train, with queries and documents translated into Spanish.

mmarco/fr

Version of msmarco-passage, with documents translated into French.

Dataset irds.mmarco.fr.documents

Version of msmarco-passage, with documents translated into French.

Dataset irds.mmarco.fr.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into French.

Dataset irds.mmarco.fr.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into French.

Dataset irds.mmarco.fr.dev

Version of msmarco-passage/dev, with queries and documents translated into French.

Dataset irds.mmarco.fr.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into French.

Dataset irds.mmarco.fr.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into French.

Dataset irds.mmarco.fr.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into French.

Dataset irds.mmarco.fr.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into French.

Dataset irds.mmarco.fr.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into French.

Dataset irds.mmarco.fr.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into French.

Dataset irds.mmarco.fr.train.qrels

Version of msmarco-passage/train, with queries and documents translated into French.

Dataset irds.mmarco.fr.train

Version of msmarco-passage/train, with queries and documents translated into French.

mmarco/id

Version of msmarco-passage, with documents translated into Indonesian.

Dataset irds.mmarco.id.documents

Version of msmarco-passage, with documents translated into Indonesian.

Dataset irds.mmarco.id.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Indonesian.

Dataset irds.mmarco.id.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Indonesian.

Dataset irds.mmarco.id.dev

Version of msmarco-passage/dev, with queries and documents translated into Indonesian.

Dataset irds.mmarco.id.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Indonesian.

Dataset irds.mmarco.id.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Indonesian.

Dataset irds.mmarco.id.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Indonesian.

Dataset irds.mmarco.id.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Indonesian.

Dataset irds.mmarco.id.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Indonesian.

Dataset irds.mmarco.id.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Indonesian.

Dataset irds.mmarco.id.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Indonesian.

Dataset irds.mmarco.id.train

Version of msmarco-passage/train, with queries and documents translated into Indonesian.

mmarco/it

Version of msmarco-passage, with documents translated into Italian.

Dataset irds.mmarco.it.documents

Version of msmarco-passage, with documents translated into Italian.

Dataset irds.mmarco.it.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Italian.

Dataset irds.mmarco.it.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Italian.

Dataset irds.mmarco.it.dev

Version of msmarco-passage/dev, with queries and documents translated into Italian.

Dataset irds.mmarco.it.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Italian.

Dataset irds.mmarco.it.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Italian.

Dataset irds.mmarco.it.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Italian.

Dataset irds.mmarco.it.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Italian.

Dataset irds.mmarco.it.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Italian.

Dataset irds.mmarco.it.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Italian.

Dataset irds.mmarco.it.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Italian.

Dataset irds.mmarco.it.train

Version of msmarco-passage/train, with queries and documents translated into Italian.

mmarco/pt

Version of msmarco-passage, with documents translated into Portuguese.

Dataset irds.mmarco.pt.documents

Version of msmarco-passage, with documents translated into Portuguese.

Dataset irds.mmarco.pt.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Portuguese.

Dataset irds.mmarco.pt.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Portuguese.

Dataset irds.mmarco.pt.dev

Version of msmarco-passage/dev, with queries and documents translated into Portuguese.

Dataset irds.mmarco.pt.dev.small.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Portuguese.

Dataset irds.mmarco.pt.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Portuguese.

Dataset irds.mmarco.pt.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Portuguese.

Dataset irds.mmarco.pt.dev.small.v1.1.queries

Version of msmarco-passage/dev, with queries and documents translated into Portuguese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here. It also removes some duplicated query IDs.

Dataset irds.mmarco.pt.dev.small.v1.1.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Portuguese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here. It also removes some duplicated query IDs.

Dataset irds.mmarco.pt.dev.small.v1.1.qrels

Version of msmarco-passage/dev, with queries and documents translated into Portuguese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here. It also removes some duplicated query IDs.

Dataset irds.mmarco.pt.dev.small.v1.1

Version of msmarco-passage/dev, with queries and documents translated into Portuguese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here. It also removes some duplicated query IDs.

Dataset irds.mmarco.pt.dev.v1.1.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Portuguese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here. It also removes some duplicated query IDs.

Dataset irds.mmarco.pt.dev.v1.1.qrels

Version of msmarco-passage/dev, with queries and documents translated into Portuguese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here. It also removes some duplicated query IDs.

Dataset irds.mmarco.pt.dev.v1.1

Version of msmarco-passage/dev, with queries and documents translated into Portuguese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here. It also removes some duplicated query IDs.

Dataset irds.mmarco.pt.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Portuguese.

Dataset irds.mmarco.pt.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Portuguese.

Dataset irds.mmarco.pt.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Portuguese.

Dataset irds.mmarco.pt.train

Version of msmarco-passage/train, with queries and documents translated into Portuguese.

Dataset irds.mmarco.pt.train.v1.1.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Portuguese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here. It also removes some duplicated query IDs.

Dataset irds.mmarco.pt.train.v1.1.docpairs

Version of msmarco-passage/train, with queries and documents translated into Portuguese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here. It also removes some duplicated query IDs.

Dataset irds.mmarco.pt.train.v1.1.qrels

Version of msmarco-passage/train, with queries and documents translated into Portuguese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here. It also removes some duplicated query IDs.

Dataset irds.mmarco.pt.train.v1.1

Version of msmarco-passage/train, with queries and documents translated into Portuguese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here. It also removes some duplicated query IDs.

mmarco/ru

Version of msmarco-passage, with documents translated into Russian.

Dataset irds.mmarco.ru.documents

Version of msmarco-passage, with documents translated into Russian.

Dataset irds.mmarco.ru.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Russian.

Dataset irds.mmarco.ru.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Russian.

Dataset irds.mmarco.ru.dev

Version of msmarco-passage/dev, with queries and documents translated into Russian.

Dataset irds.mmarco.ru.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Russian.

Dataset irds.mmarco.ru.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Russian.

Dataset irds.mmarco.ru.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Russian.

Dataset irds.mmarco.ru.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Russian.

Dataset irds.mmarco.ru.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Russian.

Dataset irds.mmarco.ru.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Russian.

Dataset irds.mmarco.ru.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Russian.

Dataset irds.mmarco.ru.train

Version of msmarco-passage/train, with queries and documents translated into Russian.

mmarco/v2/ar

Version of msmarco-passage, with queries and documents translated into Arabic.

Dataset irds.mmarco.v2.ar.documents

Version of msmarco-passage, with queries and documents translated into Arabic.

Dataset irds.mmarco.v2.ar.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Arabic.

Dataset irds.mmarco.v2.ar.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Arabic.

Dataset irds.mmarco.v2.ar.dev

Version of msmarco-passage/dev, with queries and documents translated into Arabic.

Dataset irds.mmarco.v2.ar.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Arabic.

Dataset irds.mmarco.v2.ar.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Arabic.

Dataset irds.mmarco.v2.ar.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Arabic.

Dataset irds.mmarco.v2.ar.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Arabic.

Dataset irds.mmarco.v2.ar.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Arabic.

Dataset irds.mmarco.v2.ar.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Arabic.

Dataset irds.mmarco.v2.ar.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Arabic.

Dataset irds.mmarco.v2.ar.train

Version of msmarco-passage/train, with queries and documents translated into Arabic.

mmarco/v2/de

Version of msmarco-passage, with queries and documents translated into German.

Dataset irds.mmarco.v2.de.documents

Version of msmarco-passage, with queries and documents translated into German.

Dataset irds.mmarco.v2.de.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into German.

Dataset irds.mmarco.v2.de.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into German.

Dataset irds.mmarco.v2.de.dev

Version of msmarco-passage/dev, with queries and documents translated into German.

Dataset irds.mmarco.v2.de.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into German.

Dataset irds.mmarco.v2.de.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into German.

Dataset irds.mmarco.v2.de.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into German.

Dataset irds.mmarco.v2.de.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into German.

Dataset irds.mmarco.v2.de.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into German.

Dataset irds.mmarco.v2.de.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into German.

Dataset irds.mmarco.v2.de.train.qrels

Version of msmarco-passage/train, with queries and documents translated into German.

Dataset irds.mmarco.v2.de.train

Version of msmarco-passage/train, with queries and documents translated into German.

mmarco/v2/dt

Version of msmarco-passage, with queries and documents translated into Dutch.

Dataset irds.mmarco.v2.dt.documents

Version of msmarco-passage, with queries and documents translated into Dutch.

Dataset irds.mmarco.v2.dt.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Dutch.

Dataset irds.mmarco.v2.dt.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Dutch.

Dataset irds.mmarco.v2.dt.dev

Version of msmarco-passage/dev, with queries and documents translated into Dutch.

Dataset irds.mmarco.v2.dt.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Dutch.

Dataset irds.mmarco.v2.dt.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Dutch.

Dataset irds.mmarco.v2.dt.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Dutch.

Dataset irds.mmarco.v2.dt.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Dutch.

Dataset irds.mmarco.v2.dt.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Dutch.

Dataset irds.mmarco.v2.dt.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Dutch.

Dataset irds.mmarco.v2.dt.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Dutch.

Dataset irds.mmarco.v2.dt.train

Version of msmarco-passage/train, with queries and documents translated into Dutch.

mmarco/v2/es

Version of msmarco-passage, with queries and documents translated into Spanish.

Dataset irds.mmarco.v2.es.documents

Version of msmarco-passage, with queries and documents translated into Spanish.

Dataset irds.mmarco.v2.es.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Spanish.

Dataset irds.mmarco.v2.es.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Spanish.

Dataset irds.mmarco.v2.es.dev

Version of msmarco-passage/dev, with queries and documents translated into Spanish.

Dataset irds.mmarco.v2.es.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Spanish.

Dataset irds.mmarco.v2.es.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Spanish.

Dataset irds.mmarco.v2.es.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Spanish.

Dataset irds.mmarco.v2.es.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Spanish.

Dataset irds.mmarco.v2.es.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Spanish.

Dataset irds.mmarco.v2.es.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Spanish.

Dataset irds.mmarco.v2.es.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Spanish.

Dataset irds.mmarco.v2.es.train

Version of msmarco-passage/train, with queries and documents translated into Spanish.

mmarco/v2/fr

Version of msmarco-passage, with queries and documents translated into French.

Dataset irds.mmarco.v2.fr.documents

Version of msmarco-passage, with queries and documents translated into French.

Dataset irds.mmarco.v2.fr.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into French.

Dataset irds.mmarco.v2.fr.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into French.

Dataset irds.mmarco.v2.fr.dev

Version of msmarco-passage/dev, with queries and documents translated into French.

Dataset irds.mmarco.v2.fr.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into French.

Dataset irds.mmarco.v2.fr.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into French.

Dataset irds.mmarco.v2.fr.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into French.

Dataset irds.mmarco.v2.fr.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into French.

Dataset irds.mmarco.v2.fr.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into French.

Dataset irds.mmarco.v2.fr.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into French.

Dataset irds.mmarco.v2.fr.train.qrels

Version of msmarco-passage/train, with queries and documents translated into French.

Dataset irds.mmarco.v2.fr.train

Version of msmarco-passage/train, with queries and documents translated into French.

mmarco/v2/hi

Version of msmarco-passage, with queries and documents translated into Hindi.

Dataset irds.mmarco.v2.hi.documents

Version of msmarco-passage, with queries and documents translated into Hindi.

Dataset irds.mmarco.v2.hi.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Hindi.

Dataset irds.mmarco.v2.hi.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Hindi.

Dataset irds.mmarco.v2.hi.dev

Version of msmarco-passage/dev, with queries and documents translated into Hindi.

Dataset irds.mmarco.v2.hi.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Hindi.

Dataset irds.mmarco.v2.hi.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Hindi.

Dataset irds.mmarco.v2.hi.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Hindi.

Dataset irds.mmarco.v2.hi.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Hindi.

Dataset irds.mmarco.v2.hi.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Hindi.

Dataset irds.mmarco.v2.hi.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Hindi.

Dataset irds.mmarco.v2.hi.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Hindi.

Dataset irds.mmarco.v2.hi.train

Version of msmarco-passage/train, with queries and documents translated into Hindi.

mmarco/v2/id

Version of msmarco-passage, with queries and documents translated into Indonesian.

Dataset irds.mmarco.v2.id.documents

Version of msmarco-passage, with queries and documents translated into Indonesian.

Dataset irds.mmarco.v2.id.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Indonesian.

Dataset irds.mmarco.v2.id.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Indonesian.

Dataset irds.mmarco.v2.id.dev

Version of msmarco-passage/dev, with queries and documents translated into Indonesian.

Dataset irds.mmarco.v2.id.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Indonesian.

Dataset irds.mmarco.v2.id.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Indonesian.

Dataset irds.mmarco.v2.id.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Indonesian.

Dataset irds.mmarco.v2.id.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Indonesian.

Dataset irds.mmarco.v2.id.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Indonesian.

Dataset irds.mmarco.v2.id.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Indonesian.

Dataset irds.mmarco.v2.id.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Indonesian.

Dataset irds.mmarco.v2.id.train

Version of msmarco-passage/train, with queries and documents translated into Indonesian.

mmarco/v2/it

Version of msmarco-passage, with queries and documents translated into Italian.

Dataset irds.mmarco.v2.it.documents

Version of msmarco-passage, with queries and documents translated into Italian.

Dataset irds.mmarco.v2.it.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Italian.

Dataset irds.mmarco.v2.it.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Italian.

Dataset irds.mmarco.v2.it.dev

Version of msmarco-passage/dev, with queries and documents translated into Italian.

Dataset irds.mmarco.v2.it.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Italian.

Dataset irds.mmarco.v2.it.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Italian.

Dataset irds.mmarco.v2.it.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Italian.

Dataset irds.mmarco.v2.it.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Italian.

Dataset irds.mmarco.v2.it.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Italian.

Dataset irds.mmarco.v2.it.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Italian.

Dataset irds.mmarco.v2.it.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Italian.

Dataset irds.mmarco.v2.it.train

Version of msmarco-passage/train, with queries and documents translated into Italian.

mmarco/v2/ja

Version of msmarco-passage, with queries and documents translated into Japanese.

Dataset irds.mmarco.v2.ja.documents

Version of msmarco-passage, with queries and documents translated into Japanese.

Dataset irds.mmarco.v2.ja.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Japanese.

Dataset irds.mmarco.v2.ja.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Japanese.

Dataset irds.mmarco.v2.ja.dev

Version of msmarco-passage/dev, with queries and documents translated into Japanese.

Dataset irds.mmarco.v2.ja.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Japanese.

Dataset irds.mmarco.v2.ja.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Japanese.

Dataset irds.mmarco.v2.ja.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Japanese.

Dataset irds.mmarco.v2.ja.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Japanese.

Dataset irds.mmarco.v2.ja.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Japanese.

Dataset irds.mmarco.v2.ja.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Japanese.

Dataset irds.mmarco.v2.ja.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Japanese.

Dataset irds.mmarco.v2.ja.train

Version of msmarco-passage/train, with queries and documents translated into Japanese.

mmarco/v2/pt

Version of msmarco-passage, with queries and documents translated into Portuguese.

Dataset irds.mmarco.v2.pt.documents

Version of msmarco-passage, with queries and documents translated into Portuguese.

Dataset irds.mmarco.v2.pt.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Portuguese.

Dataset irds.mmarco.v2.pt.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Portuguese.

Dataset irds.mmarco.v2.pt.dev

Version of msmarco-passage/dev, with queries and documents translated into Portuguese.

Dataset irds.mmarco.v2.pt.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Portuguese.

Dataset irds.mmarco.v2.pt.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Portuguese.

Dataset irds.mmarco.v2.pt.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Portuguese.

Dataset irds.mmarco.v2.pt.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Portuguese.

Dataset irds.mmarco.v2.pt.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Portuguese.

Dataset irds.mmarco.v2.pt.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Portuguese.

Dataset irds.mmarco.v2.pt.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Portuguese.

Dataset irds.mmarco.v2.pt.train

Version of msmarco-passage/train, with queries and documents translated into Portuguese.

mmarco/v2/ru

Version of msmarco-passage, with queries and documents translated into Russian.

Dataset irds.mmarco.v2.ru.documents

Version of msmarco-passage, with queries and documents translated into Russian.

Dataset irds.mmarco.v2.ru.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Russian.

Dataset irds.mmarco.v2.ru.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Russian.

Dataset irds.mmarco.v2.ru.dev

Version of msmarco-passage/dev, with queries and documents translated into Russian.

Dataset irds.mmarco.v2.ru.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Russian.

Dataset irds.mmarco.v2.ru.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Russian.

Dataset irds.mmarco.v2.ru.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Russian.

Dataset irds.mmarco.v2.ru.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Russian.

Dataset irds.mmarco.v2.ru.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Russian.

Dataset irds.mmarco.v2.ru.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Russian.

Dataset irds.mmarco.v2.ru.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Russian.

Dataset irds.mmarco.v2.ru.train

Version of msmarco-passage/train, with queries and documents translated into Russian.

mmarco/v2/vi

Version of msmarco-passage, with queries and documents translated into Vietnamese.

Dataset irds.mmarco.v2.vi.documents

Version of msmarco-passage, with queries and documents translated into Vietnamese.

Dataset irds.mmarco.v2.vi.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Vietnamese.

Dataset irds.mmarco.v2.vi.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Vietnamese.

Dataset irds.mmarco.v2.vi.dev

Version of msmarco-passage/dev, with queries and documents translated into Vietnamese.

Dataset irds.mmarco.v2.vi.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Vietnamese.

Dataset irds.mmarco.v2.vi.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Vietnamese.

Dataset irds.mmarco.v2.vi.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Vietnamese.

Dataset irds.mmarco.v2.vi.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Vietnamese.

Dataset irds.mmarco.v2.vi.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Vietnamese.

Dataset irds.mmarco.v2.vi.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Vietnamese.

Dataset irds.mmarco.v2.vi.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Vietnamese.

Dataset irds.mmarco.v2.vi.train

Version of msmarco-passage/train, with queries and documents translated into Vietnamese.

mmarco/v2/zh

Version of msmarco-passage, with queries and documents translated into Chinese.

Dataset irds.mmarco.v2.zh.documents

Version of msmarco-passage, with queries and documents translated into Chinese.

Dataset irds.mmarco.v2.zh.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Chinese.

Dataset irds.mmarco.v2.zh.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Chinese.

Dataset irds.mmarco.v2.zh.dev

Version of msmarco-passage/dev, with queries and documents translated into Chinese.

Dataset irds.mmarco.v2.zh.dev.small.queries

Version of msmarco-passage/dev/small, with queries and documents translated into Chinese.

Dataset irds.mmarco.v2.zh.dev.small.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Chinese.

Dataset irds.mmarco.v2.zh.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Chinese.

Dataset irds.mmarco.v2.zh.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Chinese.

Dataset irds.mmarco.v2.zh.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Chinese.

Dataset irds.mmarco.v2.zh.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Chinese.

Dataset irds.mmarco.v2.zh.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Chinese.

Dataset irds.mmarco.v2.zh.train

Version of msmarco-passage/train, with queries and documents translated into Chinese.

mmarco/zh

Version of msmarco-passage, with documents translated into Chinese.

Dataset irds.mmarco.zh.documents

Version of msmarco-passage, with documents translated into Chinese.

Dataset irds.mmarco.zh.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Chinese.

Dataset irds.mmarco.zh.dev.qrels

Version of msmarco-passage/dev, with queries and documents translated into Chinese.

Dataset irds.mmarco.zh.dev

Version of msmarco-passage/dev, with queries and documents translated into Chinese.

Dataset irds.mmarco.zh.dev.small.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev/small, with queries and documents translated into Chinese.

Dataset irds.mmarco.zh.dev.small.qrels

Version of msmarco-passage/dev/small, with queries and documents translated into Chinese.

Dataset irds.mmarco.zh.dev.small

Version of msmarco-passage/dev/small, with queries and documents translated into Chinese.

Dataset irds.mmarco.zh.dev.small.v1.1.queries

Version of msmarco-passage/dev, with queries and documents translated into Chinese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here.

Dataset irds.mmarco.zh.dev.small.v1.1.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Chinese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here.

Dataset irds.mmarco.zh.dev.small.v1.1.qrels

Version of msmarco-passage/dev, with queries and documents translated into Chinese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here.

Dataset irds.mmarco.zh.dev.small.v1.1

Version of msmarco-passage/dev, with queries and documents translated into Chinese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here.

Dataset irds.mmarco.zh.dev.v1.1.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/dev, with queries and documents translated into Chinese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here.

Dataset irds.mmarco.zh.dev.v1.1.qrels

Version of msmarco-passage/dev, with queries and documents translated into Chinese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here.

Dataset irds.mmarco.zh.dev.v1.1

Version of msmarco-passage/dev, with queries and documents translated into Chinese.

Version 1.1 of this file includes manual corrections from the authorss of the translated files. See discussion here.

Dataset irds.mmarco.zh.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Version of msmarco-passage/train, with queries and documents translated into Chinese.

Dataset irds.mmarco.zh.train.docpairs: Version of msmarco-passage/train, with queries and documents translated into Chinese.

Dataset irds.mmarco.zh.train.qrels

Version of msmarco-passage/train, with queries and documents translated into Chinese.

Dataset irds.mmarco.zh.train

Version of msmarco-passage/train, with queries and documents translated into Chinese.

mr-tydi/ar

Complete Arabic dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ar.documents

Complete Arabic dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ar.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Complete Arabic dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ar.qrels

Complete Arabic dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ar

Complete Arabic dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ar.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development set for Arabic

Dataset irds.mr-tydi.ar.dev.qrels

Development set for Arabic

Dataset irds.mr-tydi.ar.dev

Development set for Arabic

Dataset irds.mr-tydi.ar.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set for Arabic

Dataset irds.mr-tydi.ar.test.qrels

Test set for Arabic

Dataset irds.mr-tydi.ar.test

Test set for Arabic

Dataset irds.mr-tydi.ar.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train set for Arabic

Dataset irds.mr-tydi.ar.train.qrels

Train set for Arabic

Dataset irds.mr-tydi.ar.train

Train set for Arabic

mr-tydi/bn

Complete Bengali dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.bn.documents

Complete Bengali dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.bn.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Complete Bengali dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.bn.qrels

Complete Bengali dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.bn

Complete Bengali dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.bn.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development set for Bengali

Dataset irds.mr-tydi.bn.dev.qrels

Development set for Bengali

Dataset irds.mr-tydi.bn.dev

Development set for Bengali

Dataset irds.mr-tydi.bn.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set for Bengali

Dataset irds.mr-tydi.bn.test.qrels

Test set for Bengali

Dataset irds.mr-tydi.bn.test

Test set for Bengali

Dataset irds.mr-tydi.bn.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train set for Bengali

Dataset irds.mr-tydi.bn.train.qrels

Train set for Bengali

Dataset irds.mr-tydi.bn.train

Train set for Bengali

mr-tydi/en

Complete English dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.en.documents

Complete English dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.en.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Complete English dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.en.qrels

Complete English dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.en

Complete English dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.en.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development set for English

Dataset irds.mr-tydi.en.dev.qrels

Development set for English

Dataset irds.mr-tydi.en.dev

Development set for English

Dataset irds.mr-tydi.en.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set for English

Dataset irds.mr-tydi.en.test.qrels

Test set for English

Dataset irds.mr-tydi.en.test

Test set for English

Dataset irds.mr-tydi.en.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train set for English

Dataset irds.mr-tydi.en.train.qrels

Train set for English

Dataset irds.mr-tydi.en.train

Train set for English

mr-tydi/fi

Complete Finnish dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.fi.documents

Complete Finnish dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.fi.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Complete Finnish dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.fi.qrels

Complete Finnish dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.fi

Complete Finnish dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.fi.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development set for Finnish

Dataset irds.mr-tydi.fi.dev.qrels

Development set for Finnish

Dataset irds.mr-tydi.fi.dev

Development set for Finnish

Dataset irds.mr-tydi.fi.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set for Finnish

Dataset irds.mr-tydi.fi.test.qrels

Test set for Finnish

Dataset irds.mr-tydi.fi.test

Test set for Finnish

Dataset irds.mr-tydi.fi.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train set for Finnish

Dataset irds.mr-tydi.fi.train.qrels

Train set for Finnish

Dataset irds.mr-tydi.fi.train

Train set for Finnish

mr-tydi/id

Complete Indonesian dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.id.documents

Complete Indonesian dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.id.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Complete Indonesian dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.id.qrels

Complete Indonesian dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.id

Complete Indonesian dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.id.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development set for Indonesian

Dataset irds.mr-tydi.id.dev.qrels

Development set for Indonesian

Dataset irds.mr-tydi.id.dev

Development set for Indonesian

Dataset irds.mr-tydi.id.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set for Indonesian

Dataset irds.mr-tydi.id.test.qrels

Test set for Indonesian

Dataset irds.mr-tydi.id.test

Test set for Indonesian

Dataset irds.mr-tydi.id.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train set for Indonesian

Dataset irds.mr-tydi.id.train.qrels

Train set for Indonesian

Dataset irds.mr-tydi.id.train

Train set for Indonesian

mr-tydi/ja

Complete Japanese dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ja.documents

Complete Japanese dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ja.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Complete Japanese dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ja.qrels

Complete Japanese dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ja

Complete Japanese dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ja.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development set for Japanese

Dataset irds.mr-tydi.ja.dev.qrels

Development set for Japanese

Dataset irds.mr-tydi.ja.dev

Development set for Japanese

Dataset irds.mr-tydi.ja.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set for Japanese

Dataset irds.mr-tydi.ja.test.qrels

Test set for Japanese

Dataset irds.mr-tydi.ja.test

Test set for Japanese

Dataset irds.mr-tydi.ja.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train set for Japanese

Dataset irds.mr-tydi.ja.train.qrels

Train set for Japanese

Dataset irds.mr-tydi.ja.train

Train set for Japanese

mr-tydi/ko

Complete Korean dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ko.documents

Complete Korean dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ko.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Complete Korean dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ko.qrels

Complete Korean dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ko

Complete Korean dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ko.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development set for Korean

Dataset irds.mr-tydi.ko.dev.qrels

Development set for Korean

Dataset irds.mr-tydi.ko.dev

Development set for Korean

Dataset irds.mr-tydi.ko.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set for Korean

Dataset irds.mr-tydi.ko.test.qrels

Test set for Korean

Dataset irds.mr-tydi.ko.test

Test set for Korean

Dataset irds.mr-tydi.ko.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train set for Korean

Dataset irds.mr-tydi.ko.train.qrels

Train set for Korean

Dataset irds.mr-tydi.ko.train

Train set for Korean

mr-tydi/ru

Complete Russian dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ru.documents

Complete Russian dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ru.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Complete Russian dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ru.qrels

Complete Russian dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ru

Complete Russian dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.ru.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development set for Russian

Dataset irds.mr-tydi.ru.dev.qrels

Development set for Russian

Dataset irds.mr-tydi.ru.dev

Development set for Russian

Dataset irds.mr-tydi.ru.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set for Russian

Dataset irds.mr-tydi.ru.test.qrels

Test set for Russian

Dataset irds.mr-tydi.ru.test

Test set for Russian

Dataset irds.mr-tydi.ru.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train set for Russian

Dataset irds.mr-tydi.ru.train.qrels

Train set for Russian

Dataset irds.mr-tydi.ru.train

Train set for Russian

mr-tydi/sw

Complete Swahili dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.sw.documents

Complete Swahili dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.sw.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Complete Swahili dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.sw.qrels

Complete Swahili dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.sw

Complete Swahili dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.sw.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development set for Swahili

Dataset irds.mr-tydi.sw.dev.qrels

Development set for Swahili

Dataset irds.mr-tydi.sw.dev

Development set for Swahili

Dataset irds.mr-tydi.sw.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set for Swahili

Dataset irds.mr-tydi.sw.test.qrels

Test set for Swahili

Dataset irds.mr-tydi.sw.test

Test set for Swahili

Dataset irds.mr-tydi.sw.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train set for Swahili

Dataset irds.mr-tydi.sw.train.qrels

Train set for Swahili

Dataset irds.mr-tydi.sw.train

Train set for Swahili

mr-tydi/te

Complete Telugu dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.te.documents

Complete Telugu dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.te.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Complete Telugu dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.te.qrels

Complete Telugu dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.te

Complete Telugu dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.te.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development set for Telugu

Dataset irds.mr-tydi.te.dev.qrels

Development set for Telugu

Dataset irds.mr-tydi.te.dev

Development set for Telugu

Dataset irds.mr-tydi.te.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set for Telugu

Dataset irds.mr-tydi.te.test.qrels

Test set for Telugu

Dataset irds.mr-tydi.te.test

Test set for Telugu

Dataset irds.mr-tydi.te.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train set for Telugu

Dataset irds.mr-tydi.te.train.qrels

Train set for Telugu

Dataset irds.mr-tydi.te.train

Train set for Telugu

mr-tydi/th

Complete Thai dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.th.documents

Complete Thai dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.th.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Complete Thai dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.th.qrels

Complete Thai dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.th

Complete Thai dataset, including all train, dev, and test queries and qrels.

Dataset irds.mr-tydi.th.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development set for Thai

Dataset irds.mr-tydi.th.dev.qrels

Development set for Thai

Dataset irds.mr-tydi.th.dev

Development set for Thai

Dataset irds.mr-tydi.th.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set for Thai

Dataset irds.mr-tydi.th.test.qrels

Test set for Thai

Dataset irds.mr-tydi.th.test

Test set for Thai

Dataset irds.mr-tydi.th.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train set for Thai

Dataset irds.mr-tydi.th.train.qrels

Train set for Thai

Dataset irds.mr-tydi.th.train

Train set for Thai

MSMARCO (document)

"Based the questions in the [MS-MARCO] Question Answering Dataset and the documents which answered the questions a document ranking task was formulated. There are 3.2 million documents and the goal is to rank based on their relevance. Relevance labels are derived from what passages was marked as having the answer in the QnA dataset."

See also: msmarco-passage
Documents: Text extracted from web pages
Queries: Natural language questions (from query log)
Leaderboard
Dataset Paper

Dataset irds.msmarco-document.documents

"Based the questions in the [MS-MARCO] Question Answering Dataset and the documents which answered the questions a document ranking task was formulated. There are 3.2 million documents and the goal is to rank based on their relevance. Relevance labels are derived from what passages was marked as having the answer in the QnA dataset."

See also: msmarco-passage
Documents: Text extracted from web pages
Queries: Natural language questions (from query log)
Leaderboard
Dataset Paper

Dataset irds.msmarco-document.dev.queries

Official dev set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Dataset irds.msmarco-document.dev.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official dev set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Dataset irds.msmarco-document.dev.qrels

Official dev set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Dataset irds.msmarco-document.dev

Official dev set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Dataset irds.msmarco-document.eval.queries

Official eval set for submission to MS MARCO leaderboard. Relevance judgments are hidden.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Dataset irds.msmarco-document.eval.scoreddocs

Official eval set for submission to MS MARCO leaderboard. Relevance judgments are hidden.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Dataset irds.msmarco-document.orcas.queries

"ORCAS is a click-based dataset associated with the TREC Deep Learning Track. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries."

Queries: From query log
Relevance Data: User clicks
Scored docs: Indri Query Likelihood model
Dataset Paper

Dataset irds.msmarco-document.orcas.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

"ORCAS is a click-based dataset associated with the TREC Deep Learning Track. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries."

Queries: From query log
Relevance Data: User clicks
Scored docs: Indri Query Likelihood model
Dataset Paper

Dataset irds.msmarco-document.orcas.qrels

"ORCAS is a click-based dataset associated with the TREC Deep Learning Track. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries."

Queries: From query log
Relevance Data: User clicks
Scored docs: Indri Query Likelihood model
Dataset Paper

Dataset irds.msmarco-document.orcas

"ORCAS is a click-based dataset associated with the TREC Deep Learning Track. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries."

Queries: From query log
Relevance Data: User clicks
Scored docs: Indri Query Likelihood model
Dataset Paper

Dataset irds.msmarco-document.train.queries

Official train set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Dataset irds.msmarco-document.train.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official train set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Dataset irds.msmarco-document.train.qrels

Official train set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Dataset irds.msmarco-document.train

Official train set. All queries have exactly 1 (positive) relevance judgment.

scoreddocs are the top 100 results from Indri QL. These are used for the "re-ranking" setting.

Dataset irds.msmarco-document.trec-dl-2019.queries

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2019/judged).

Shared Task Paper

Dataset irds.msmarco-document.trec-dl-2019.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2019/judged).

Shared Task Paper

Dataset irds.msmarco-document.trec-dl-2019.qrels

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2019/judged).

Shared Task Paper

Dataset irds.msmarco-document.trec-dl-2019

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2019/judged).

Shared Task Paper

Dataset irds.msmarco-document.trec-dl-2019.judged.queries

Subset of msmarco-document/trec-dl-2019, only including queries with qrels.

Dataset irds.msmarco-document.trec-dl-2019.judged.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of msmarco-document/trec-dl-2019, only including queries with qrels.

Dataset irds.msmarco-document.trec-dl-2019.judged.qrels

Subset of msmarco-document/trec-dl-2019, only including queries with qrels.

Dataset irds.msmarco-document.trec-dl-2019.judged

Subset of msmarco-document/trec-dl-2019, only including queries with qrels.

Dataset irds.msmarco-document.trec-dl-2020.queries

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2020/judged).

Shared Task Paper

Dataset irds.msmarco-document.trec-dl-2020.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2020/judged).

Shared Task Paper

Dataset irds.msmarco-document.trec-dl-2020.qrels

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2020/judged).

Shared Task Paper

Dataset irds.msmarco-document.trec-dl-2020

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document/trec-dl-2020/judged).

Shared Task Paper

Dataset irds.msmarco-document.trec-dl-2020.judged.queries

Subset of msmarco-document/trec-dl-2020, only including queries with qrels.

Dataset irds.msmarco-document.trec-dl-2020.judged.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of msmarco-document/trec-dl-2020, only including queries with qrels.

Dataset irds.msmarco-document.trec-dl-2020.judged.qrels

Subset of msmarco-document/trec-dl-2020, only including queries with qrels.

Dataset irds.msmarco-document.trec-dl-2020.judged

Subset of msmarco-document/trec-dl-2020, only including queries with qrels.

Dataset irds.msmarco-document.trec-dl-hard.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A more challenging subset of msmarco-document/trec-dl-2019 and msmarco-document/trec-dl-2020.

data website
See Also: msmarco-passage/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.qrels

A more challenging subset of msmarco-document/trec-dl-2019 and msmarco-document/trec-dl-2020.

data website
See Also: msmarco-passage/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard

A more challenging subset of msmarco-document/trec-dl-2019 and msmarco-document/trec-dl-2020.

data website
See Also: msmarco-passage/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold1.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 1 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold1.qrels

Fold 1 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold1

Fold 1 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold2.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 2 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold2.qrels

Fold 2 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold2

Fold 2 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold3.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 3 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold3.qrels

Fold 3 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold3

Fold 3 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold4.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 4 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold4.qrels

Fold 4 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold4

Fold 4 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold5.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Fold 5 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold5.qrels

Fold 5 of msmarco-document/trec-dl-hard

Dataset irds.msmarco-document.trec-dl-hard.fold5

Fold 5 of msmarco-document/trec-dl-hard

Anchor Text for Version 1 of MS MARCO

For version 1 of MS MARCO, the anchor text collection enriches 1,703,834 documents with anchor text extracted from six Common Crawl snapshots. To keep the collection size reasonable, we sampled 1,000 anchor texts for documents with more than 1,000 anchor texts (this sampling yields that all anchor text is included for 94% of the documents). The text field contains the anchor texts concatenated and the anchors field contains the anchor texts as list. The raw dataset with additional information (roughly 100GB) is available online.

Dataset irds.msmarco-document.anchor-text.documents

For version 1 of MS MARCO, the anchor text collection enriches 1,703,834 documents with anchor text extracted from six Common Crawl snapshots. To keep the collection size reasonable, we sampled 1,000 anchor texts for documents with more than 1,000 anchor texts (this sampling yields that all anchor text is included for 94% of the documents). The text field contains the anchor texts concatenated and the anchors field contains the anchor texts as list. The raw dataset with additional information (roughly 100GB) is available online.

MSMARCO (document, version 2)

Version 2 of the MS MARCO document ranking dataset. The corpus contains 12M documents (roughly 3x as many as version 1).

Version 1 of dataset: msmarco-document
Documents: Text extracted from web pages
Queries: Natural language questions (from query log)
Dataset Paper

Dataset irds.msmarco-document-v2.documents

Version 2 of the MS MARCO document ranking dataset. The corpus contains 12M documents (roughly 3x as many as version 1).

Version 1 of dataset: msmarco-document
Documents: Text extracted from web pages
Queries: Natural language questions (from query log)
Dataset Paper

Dataset irds.msmarco-document-v2.dev1.queries

Official dev1 set with 4,552 queries.

Dataset irds.msmarco-document-v2.dev1.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official dev1 set with 4,552 queries.

Dataset irds.msmarco-document-v2.dev1.qrels

Official dev1 set with 4,552 queries.

Dataset irds.msmarco-document-v2.dev1

Official dev1 set with 4,552 queries.

Dataset irds.msmarco-document-v2.dev2.queries

Official dev2 set with 5,000 queries.

Dataset irds.msmarco-document-v2.dev2.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official dev2 set with 5,000 queries.

Dataset irds.msmarco-document-v2.dev2.qrels

Official dev2 set with 5,000 queries.

Dataset irds.msmarco-document-v2.dev2

Official dev2 set with 5,000 queries.

Dataset irds.msmarco-document-v2.train.queries

Official train set with 322,196 queries.

Dataset irds.msmarco-document-v2.train.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official train set with 322,196 queries.

Dataset irds.msmarco-document-v2.train.qrels

Official train set with 322,196 queries.

Dataset irds.msmarco-document-v2.train

Official train set with 322,196 queries.

Dataset irds.msmarco-document-v2.trec-dl-2019.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2019/judged).

Shared Task Paper

Dataset irds.msmarco-document-v2.trec-dl-2019.qrels

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2019/judged).

Shared Task Paper

Dataset irds.msmarco-document-v2.trec-dl-2019

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2019/judged).

Shared Task Paper

Dataset irds.msmarco-document-v2.trec-dl-2019.judged.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of msmarco-document-v2/trec-dl-2019, only including queries with qrels.

Dataset irds.msmarco-document-v2.trec-dl-2019.judged.qrels

Subset of msmarco-document-v2/trec-dl-2019, only including queries with qrels.

Dataset irds.msmarco-document-v2.trec-dl-2019.judged

Subset of msmarco-document-v2/trec-dl-2019, only including queries with qrels.

Dataset irds.msmarco-document-v2.trec-dl-2020.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2020/judged).

Shared Task Paper

Dataset irds.msmarco-document-v2.trec-dl-2020.qrels

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2020/judged).

Shared Task Paper

Dataset irds.msmarco-document-v2.trec-dl-2020

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-document/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-document-v2/trec-dl-2020/judged).

Shared Task Paper

Dataset irds.msmarco-document-v2.trec-dl-2020.judged.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of msmarco-document-v2/trec-dl-2020, only including queries with qrels.

Dataset irds.msmarco-document-v2.trec-dl-2020.judged.qrels

Subset of msmarco-document-v2/trec-dl-2020, only including queries with qrels.

Dataset irds.msmarco-document-v2.trec-dl-2020.judged

Subset of msmarco-document-v2/trec-dl-2020, only including queries with qrels.

Dataset irds.msmarco-document-v2.trec-dl-2021.queries

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Note that at this time, qrels are only available to those with TREC active participant login credentials.

Dataset irds.msmarco-document-v2.trec-dl-2021.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Note that at this time, qrels are only available to those with TREC active participant login credentials.

Dataset irds.msmarco-document-v2.trec-dl-2021.qrels

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Note that at this time, qrels are only available to those with TREC active participant login credentials.

Dataset irds.msmarco-document-v2.trec-dl-2021

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Note that at this time, qrels are only available to those with TREC active participant login credentials.

Dataset irds.msmarco-document-v2.trec-dl-2021.judged.queries

msmarco-document-v2/trec-dl-2021, but filtered down to the 57 queries with qrels.

Note that at this time, this is only available to those with TREC active participant login credentials.

Dataset irds.msmarco-document-v2.trec-dl-2021.judged.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

msmarco-document-v2/trec-dl-2021, but filtered down to the 57 queries with qrels.

Note that at this time, this is only available to those with TREC active participant login credentials.

Dataset irds.msmarco-document-v2.trec-dl-2021.judged.qrels

msmarco-document-v2/trec-dl-2021, but filtered down to the 57 queries with qrels.

Note that at this time, this is only available to those with TREC active participant login credentials.

Dataset irds.msmarco-document-v2.trec-dl-2021.judged

msmarco-document-v2/trec-dl-2021, but filtered down to the 57 queries with qrels.

Note that at this time, this is only available to those with TREC active participant login credentials.

Dataset irds.msmarco-document-v2.trec-dl-2022.queries

Official topics for the TREC Deep Learning (DL) 2022 shared task.

Note that these qrels are inferred from the passage ranking task; a document's relevance label is the maximum of the labels of its passages.

Dataset irds.msmarco-document-v2.trec-dl-2022.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official topics for the TREC Deep Learning (DL) 2022 shared task.

Note that these qrels are inferred from the passage ranking task; a document's relevance label is the maximum of the labels of its passages.

Dataset irds.msmarco-document-v2.trec-dl-2022.qrels

Official topics for the TREC Deep Learning (DL) 2022 shared task.

Note that these qrels are inferred from the passage ranking task; a document's relevance label is the maximum of the labels of its passages.

Dataset irds.msmarco-document-v2.trec-dl-2022

Official topics for the TREC Deep Learning (DL) 2022 shared task.

Note that these qrels are inferred from the passage ranking task; a document's relevance label is the maximum of the labels of its passages.

Dataset irds.msmarco-document-v2.trec-dl-2022.judged.queries

msmarco-document-v2/trec-dl-2022, but filtered down to only the queries with qrels.

Dataset irds.msmarco-document-v2.trec-dl-2022.judged.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

msmarco-document-v2/trec-dl-2022, but filtered down to only the queries with qrels.

Dataset irds.msmarco-document-v2.trec-dl-2022.judged.qrels

msmarco-document-v2/trec-dl-2022, but filtered down to only the queries with qrels.

Dataset irds.msmarco-document-v2.trec-dl-2022.judged

msmarco-document-v2/trec-dl-2022, but filtered down to only the queries with qrels.

Dataset irds.msmarco-document-v2.trec-dl-2023.queries

Official topics for the TREC Deep Learning (DL) 2023 shared task.

Dataset irds.msmarco-document-v2.trec-dl-2023.scoreddocs

Official topics for the TREC Deep Learning (DL) 2023 shared task.

Anchor Text for version 2 of MS Marco

For version 2 of MS MARCO, the anchor text collection enriches 4,821,244 documents with anchor text extracted from six Common Crawl snapshots. To keep the collection size reasonable, we sampled 1,000 anchor texts for documents with more than 1,000 anchor texts (this sampling yields that all anchor text is included for 97% of the documents). The text field contains the anchor texts concatenated and the anchors field contains the anchor texts as list. The raw dataset with additional information (roughly 100GB) is available online.

Dataset irds.msmarco-document-v2.anchor-text.documents

For version 2 of MS MARCO, the anchor text collection enriches 4,821,244 documents with anchor text extracted from six Common Crawl snapshots. To keep the collection size reasonable, we sampled 1,000 anchor texts for documents with more than 1,000 anchor texts (this sampling yields that all anchor text is included for 97% of the documents). The text field contains the anchor texts concatenated and the anchors field contains the anchor texts as list. The raw dataset with additional information (roughly 100GB) is available online.

MSMARCO (passage, version 2)

Version 2 of the MS MARCO passage ranking dataset. The corpus contains 138M passages, which can be linked up with documents in msmarco-document-v2.

Version 1 of dataset: msmarco-passage
Documents: Text extracted from web pages
Queries: Natural language questions (from query log)
Dataset Paper

Change Log

On July 21, 2021, the task organizers updated the train, dev1, and dev2 qrels to remove duplicate entries from the files. This should not have change results from evaluation tools, but may result in non-repeatable results if these files were used in another process (e.g., model training). The original qrels file for msmarco-passage-v2/train can be found here to aid in result repeatability.

Dataset irds.msmarco-passage-v2.documents

Version 2 of the MS MARCO passage ranking dataset. The corpus contains 138M passages, which can be linked up with documents in msmarco-document-v2.

Version 1 of dataset: msmarco-passage
Documents: Text extracted from web pages
Queries: Natural language questions (from query log)
Dataset Paper

Change Log

On July 21, 2021, the task organizers updated the train, dev1, and dev2 qrels to remove duplicate entries from the files. This should not have change results from evaluation tools, but may result in non-repeatable results if these files were used in another process (e.g., model training). The original qrels file for msmarco-passage-v2/train can be found here to aid in result repeatability.

Dataset irds.msmarco-passage-v2.dev1.queries

Official dev1 set with 3,903 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Dataset irds.msmarco-passage-v2.dev1.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official dev1 set with 3,903 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Dataset irds.msmarco-passage-v2.dev1.qrels

Official dev1 set with 3,903 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Dataset irds.msmarco-passage-v2.dev1

Official dev1 set with 3,903 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Dataset irds.msmarco-passage-v2.dev2.queries

Official dev2 set with 4,281 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Dataset irds.msmarco-passage-v2.dev2.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official dev2 set with 4,281 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Dataset irds.msmarco-passage-v2.dev2.qrels

Official dev2 set with 4,281 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Dataset irds.msmarco-passage-v2.dev2

Official dev2 set with 4,281 queries.

Note that that qrels in this dataset are not directly human-assessed; labels from msmarco-passage are mapped to documents via URL, these documents are re-passaged, and then the best approximate match is identified.

Dataset irds.msmarco-passage-v2.train.queries

Official train set with 277,144 queries.

Dataset irds.msmarco-passage-v2.train.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official train set with 277,144 queries.

Dataset irds.msmarco-passage-v2.train.qrels

Official train set with 277,144 queries.

Dataset irds.msmarco-passage-v2.train

Official train set with 277,144 queries.

Dataset irds.msmarco-passage-v2.trec-dl-2021.queries

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Dataset irds.msmarco-passage-v2.trec-dl-2021.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Dataset irds.msmarco-passage-v2.trec-dl-2021.qrels

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Dataset irds.msmarco-passage-v2.trec-dl-2021

Official topics for the TREC Deep Learning (DL) 2021 shared task.

Dataset irds.msmarco-passage-v2.trec-dl-2021.judged.queries

msmarco-passage-v2/trec-dl-2021, but filtered down to the 53 queries with qrels.

Dataset irds.msmarco-passage-v2.trec-dl-2021.judged.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

msmarco-passage-v2/trec-dl-2021, but filtered down to the 53 queries with qrels.

Dataset irds.msmarco-passage-v2.trec-dl-2021.judged.qrels

msmarco-passage-v2/trec-dl-2021, but filtered down to the 53 queries with qrels.

Dataset irds.msmarco-passage-v2.trec-dl-2021.judged

msmarco-passage-v2/trec-dl-2021, but filtered down to the 53 queries with qrels.

Dataset irds.msmarco-passage-v2.trec-dl-2022.queries

Official topics for the TREC Deep Learning (DL) 2022 shared task.

Note that the officially-released qrels include relevance labels propagated to duplicate passages, while results presented in the notebook papers remove duplicate documents. This means that the results are not directly comparable, and extra care should be taken when making comparisions among systems to ensure that they were evaluated in the same settings.

Dataset irds.msmarco-passage-v2.trec-dl-2022.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official topics for the TREC Deep Learning (DL) 2022 shared task.

Note that the officially-released qrels include relevance labels propagated to duplicate passages, while results presented in the notebook papers remove duplicate documents. This means that the results are not directly comparable, and extra care should be taken when making comparisions among systems to ensure that they were evaluated in the same settings.

Dataset irds.msmarco-passage-v2.trec-dl-2022.qrels

Official topics for the TREC Deep Learning (DL) 2022 shared task.

Note that the officially-released qrels include relevance labels propagated to duplicate passages, while results presented in the notebook papers remove duplicate documents. This means that the results are not directly comparable, and extra care should be taken when making comparisions among systems to ensure that they were evaluated in the same settings.

Dataset irds.msmarco-passage-v2.trec-dl-2022

Official topics for the TREC Deep Learning (DL) 2022 shared task.

Note that the officially-released qrels include relevance labels propagated to duplicate passages, while results presented in the notebook papers remove duplicate documents. This means that the results are not directly comparable, and extra care should be taken when making comparisions among systems to ensure that they were evaluated in the same settings.

Dataset irds.msmarco-passage-v2.trec-dl-2022.judged.queries

msmarco-passage-v2/trec-dl-2022, but filtered down to only the queries with qrels.

Dataset irds.msmarco-passage-v2.trec-dl-2022.judged.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

msmarco-passage-v2/trec-dl-2022, but filtered down to only the queries with qrels.

Dataset irds.msmarco-passage-v2.trec-dl-2022.judged.qrels

msmarco-passage-v2/trec-dl-2022, but filtered down to only the queries with qrels.

Dataset irds.msmarco-passage-v2.trec-dl-2022.judged

msmarco-passage-v2/trec-dl-2022, but filtered down to only the queries with qrels.

Dataset irds.msmarco-passage-v2.trec-dl-2023.queries

Official topics for the TREC Deep Learning (DL) 2023 shared task.

Dataset irds.msmarco-passage-v2.trec-dl-2023.scoreddocs

Official topics for the TREC Deep Learning (DL) 2023 shared task.

msmarco-passage-v2/dedup

Dataset irds.msmarco-passage-v2.dedup.documents: → datamaestro_text.datasets.irds.data.Documents

MSMARCO (QnA)

The MS MARCO Question Answering dataset. This is the source collection of msmarco-passage and msmarco-document.

It is prohibited to use information from this dataset for submissions to the MS MARCO passage and document leaderboards or the TREC DL shared task.

Query IDs in this collection align with those found in msmarco-passage and msmarco-document. The collection does not provide doc_ids, so these are assigned in the following format: [msmarco_passage_id]-[url_seq], where [msmarco_passage_id] is the document from msmarco-passage that has matching contents and [url_seq] is assigned sequentially for each URL encountered. In other words, all documents with the same prefix have the same text; they only differ in the originating document.

Doc msmarco_passage_id fields are assigned by matching pasasge contents in msmarco-passage, and this field is provided for every document. Doc msmarco_document_id fields are assigned by matching the URL to the one found in msmarco-document. Due to how msmarco-document was constructed, there is not necessarily a match (value will be None if no match).

Documents: Short passages (from web)
Queries: Natural language questions (from query log), including type and natural-language answers.
Leaderboard
Dataset Paper
More information

Dataset irds.msmarco-qna.documents

The MS MARCO Question Answering dataset. This is the source collection of msmarco-passage and msmarco-document.

It is prohibited to use information from this dataset for submissions to the MS MARCO passage and document leaderboards or the TREC DL shared task.

Query IDs in this collection align with those found in msmarco-passage and msmarco-document. The collection does not provide doc_ids, so these are assigned in the following format: [msmarco_passage_id]-[url_seq], where [msmarco_passage_id] is the document from msmarco-passage that has matching contents and [url_seq] is assigned sequentially for each URL encountered. In other words, all documents with the same prefix have the same text; they only differ in the originating document.

Doc msmarco_passage_id fields are assigned by matching pasasge contents in msmarco-passage, and this field is provided for every document. Doc msmarco_document_id fields are assigned by matching the URL to the one found in msmarco-document. Due to how msmarco-document was constructed, there is not necessarily a match (value will be None if no match).

Documents: Short passages (from web)
Queries: Natural language questions (from query log), including type and natural-language answers.
Leaderboard
Dataset Paper
More information

Dataset irds.msmarco-qna.dev.queries

Official dev set.

The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.

Dataset irds.msmarco-qna.dev.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official dev set.

The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.

Dataset irds.msmarco-qna.dev.qrels

Official dev set.

The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.

Dataset irds.msmarco-qna.dev

Official dev set.

The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.

Dataset irds.msmarco-qna.eval.queries

Official eval set.

The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.

Dataset irds.msmarco-qna.eval.scoreddocs

Official eval set.

The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.

Dataset irds.msmarco-qna.train.queries

Official train set.

The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.

Dataset irds.msmarco-qna.train.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official train set.

The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.

Dataset irds.msmarco-qna.train.qrels

Official train set.

The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.

Dataset irds.msmarco-qna.train

Official train set.

The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.

nano-beir/arguana

A version of the ArguAna Counterargs dataset, for argument retrieval.

Dataset irds.nano-beir.arguana.documents

A version of the ArguAna Counterargs dataset, for argument retrieval.

Dataset irds.nano-beir.arguana.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the ArguAna Counterargs dataset, for argument retrieval.

Dataset irds.nano-beir.arguana.qrels

A version of the ArguAna Counterargs dataset, for argument retrieval.

Dataset irds.nano-beir.arguana

A version of the ArguAna Counterargs dataset, for argument retrieval.

nano-beir/climate-fever

A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.

Dataset irds.nano-beir.climate-fever.documents

A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.

Dataset irds.nano-beir.climate-fever.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.

Dataset irds.nano-beir.climate-fever.qrels

A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.

Dataset irds.nano-beir.climate-fever

A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.

nano-beir/dbpedia-entity

A version of the DBPedia-Entity-v2 dataset for entity retrieval.

Dataset irds.nano-beir.dbpedia-entity.documents

A version of the DBPedia-Entity-v2 dataset for entity retrieval.

Dataset irds.nano-beir.dbpedia-entity.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the DBPedia-Entity-v2 dataset for entity retrieval.

Dataset irds.nano-beir.dbpedia-entity.qrels

A version of the DBPedia-Entity-v2 dataset for entity retrieval.

Dataset irds.nano-beir.dbpedia-entity

A version of the DBPedia-Entity-v2 dataset for entity retrieval.

nano-beir/fever

A version of the FEVER dataset for fact verification.

Dataset irds.nano-beir.fever.documents

A version of the FEVER dataset for fact verification.

Dataset irds.nano-beir.fever.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the FEVER dataset for fact verification.

Dataset irds.nano-beir.fever.qrels

A version of the FEVER dataset for fact verification.

Dataset irds.nano-beir.fever

A version of the FEVER dataset for fact verification.

nano-beir/fiqa

A version of the FIQA-2018 dataset (financial opinion question answering).

Dataset irds.nano-beir.fiqa.documents

A version of the FIQA-2018 dataset (financial opinion question answering).

Dataset irds.nano-beir.fiqa.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the FIQA-2018 dataset (financial opinion question answering).

Dataset irds.nano-beir.fiqa.qrels

A version of the FIQA-2018 dataset (financial opinion question answering).

Dataset irds.nano-beir.fiqa

A version of the FIQA-2018 dataset (financial opinion question answering).

nano-beir/hotpotqa

A version of the Hotpot QA dataset for multi-hop question answering.

Dataset irds.nano-beir.hotpotqa.documents

A version of the Hotpot QA dataset for multi-hop question answering.

Dataset irds.nano-beir.hotpotqa.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the Hotpot QA dataset for multi-hop question answering.

Dataset irds.nano-beir.hotpotqa.qrels

A version of the Hotpot QA dataset for multi-hop question answering.

Dataset irds.nano-beir.hotpotqa

A version of the Hotpot QA dataset for multi-hop question answering.

nano-beir/msmarco

A version of the MS MARCO passage ranking dataset.

Note that this version differs from msmarco-passage, in that it does not correct the encoding problems in the source documents.

Dataset irds.nano-beir.msmarco.documents

A version of the MS MARCO passage ranking dataset.

Note that this version differs from msmarco-passage, in that it does not correct the encoding problems in the source documents.

Dataset irds.nano-beir.msmarco.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the MS MARCO passage ranking dataset.

Note that this version differs from msmarco-passage, in that it does not correct the encoding problems in the source documents.

Dataset irds.nano-beir.msmarco.qrels

A version of the MS MARCO passage ranking dataset.

Note that this version differs from msmarco-passage, in that it does not correct the encoding problems in the source documents.

Dataset irds.nano-beir.msmarco

A version of the MS MARCO passage ranking dataset.

Note that this version differs from msmarco-passage, in that it does not correct the encoding problems in the source documents.

nano-beir/nfcorpus

A version of the NF Corpus (Nutrition Facts).

Data pre-processing may be different than what is done in nfcorpus.

Dataset irds.nano-beir.nfcorpus.documents

A version of the NF Corpus (Nutrition Facts).

Data pre-processing may be different than what is done in nfcorpus.

Dataset irds.nano-beir.nfcorpus.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the NF Corpus (Nutrition Facts).

Data pre-processing may be different than what is done in nfcorpus.

Dataset irds.nano-beir.nfcorpus.qrels

A version of the NF Corpus (Nutrition Facts).

Data pre-processing may be different than what is done in nfcorpus.

Dataset irds.nano-beir.nfcorpus

A version of the NF Corpus (Nutrition Facts).

Data pre-processing may be different than what is done in nfcorpus.

nano-beir/nq

A version of the Natural Questions dev dataset.

Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.

Dataset website
Dataset paper
See also: natural-questions, dpr-w100/natural-questions

Dataset irds.nano-beir.nq.documents

A version of the Natural Questions dev dataset.

Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.

Dataset website
Dataset paper
See also: natural-questions, dpr-w100/natural-questions

Dataset irds.nano-beir.nq.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the Natural Questions dev dataset.

Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.

Dataset website
Dataset paper
See also: natural-questions, dpr-w100/natural-questions

Dataset irds.nano-beir.nq.qrels

A version of the Natural Questions dev dataset.

Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.

Dataset website
Dataset paper
See also: natural-questions, dpr-w100/natural-questions

Dataset irds.nano-beir.nq

A version of the Natural Questions dev dataset.

Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.

Dataset website
Dataset paper
See also: natural-questions, dpr-w100/natural-questions

nano-beir/quora

A version of the Quora duplicate question detection dataset (QQP).

Dataset website

Dataset irds.nano-beir.quora.documents

A version of the Quora duplicate question detection dataset (QQP).

Dataset website

Dataset irds.nano-beir.quora.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the Quora duplicate question detection dataset (QQP).

Dataset website

Dataset irds.nano-beir.quora.qrels

A version of the Quora duplicate question detection dataset (QQP).

Dataset website

Dataset irds.nano-beir.quora

A version of the Quora duplicate question detection dataset (QQP).

Dataset website

nano-beir/scidocs

A version of the SciDocs dataset, used for citation retrieval.

Dataset irds.nano-beir.scidocs.documents

A version of the SciDocs dataset, used for citation retrieval.

Dataset irds.nano-beir.scidocs.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the SciDocs dataset, used for citation retrieval.

Dataset irds.nano-beir.scidocs.qrels

A version of the SciDocs dataset, used for citation retrieval.

Dataset irds.nano-beir.scidocs

A version of the SciDocs dataset, used for citation retrieval.

nano-beir/scifact

A version of the SciFact dataset, for fact verification.

Dataset irds.nano-beir.scifact.documents

A version of the SciFact dataset, for fact verification.

Dataset irds.nano-beir.scifact.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of the SciFact dataset, for fact verification.

Dataset irds.nano-beir.scifact.qrels

A version of the SciFact dataset, for fact verification.

Dataset irds.nano-beir.scifact

A version of the SciFact dataset, for fact verification.

nano-beir/webis-touche2020

Original version of the Touchè-2020 dataset, for argument retrieval.

Consider using beir/webis-touche2020/v2 instead; it uses an updated, more complete version of the qrels.

Dataset irds.nano-beir.webis-touche2020.documents

Original version of the Touchè-2020 dataset, for argument retrieval.

Consider using beir/webis-touche2020/v2 instead; it uses an updated, more complete version of the qrels.

Dataset irds.nano-beir.webis-touche2020.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Original version of the Touchè-2020 dataset, for argument retrieval.

Consider using beir/webis-touche2020/v2 instead; it uses an updated, more complete version of the qrels.

Dataset irds.nano-beir.webis-touche2020.qrels

Original version of the Touchè-2020 dataset, for argument retrieval.

Consider using beir/webis-touche2020/v2 instead; it uses an updated, more complete version of the qrels.

Dataset irds.nano-beir.webis-touche2020

Original version of the Touchè-2020 dataset, for argument retrieval.

Consider using beir/webis-touche2020/v2 instead; it uses an updated, more complete version of the qrels.

neumarco/fa

The msmarco-passage corpus, translated to Persian (Farsi).

Dataset irds.neumarco.fa.documents

The msmarco-passage corpus, translated to Persian (Farsi).

Dataset irds.neumarco.fa.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/dev, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.dev.qrels

A version of msmarco-passage/dev, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.dev

A version of msmarco-passage/dev, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.dev.judged.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/dev/judged, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.dev.judged.qrels

A version of msmarco-passage/dev/judged, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.dev.judged

A version of msmarco-passage/dev/judged, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.dev.small.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/dev/small, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.dev.small.qrels

A version of msmarco-passage/dev/small, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.dev.small

A version of msmarco-passage/dev/small, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/train, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.train.docpairs: A version of msmarco-passage/train, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.train.qrels

A version of msmarco-passage/train, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.train

A version of msmarco-passage/train, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.train.judged.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/train/judged, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.train.judged.docpairs: A version of msmarco-passage/train/judged, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.train.judged.qrels

A version of msmarco-passage/train/judged, with the corpus translated to Persian (Farsi).

Dataset irds.neumarco.fa.train.judged

A version of msmarco-passage/train/judged, with the corpus translated to Persian (Farsi).

neumarco/ru

The msmarco-passage corpus, translated to Russian.

Dataset irds.neumarco.ru.documents

The msmarco-passage corpus, translated to Russian.

Dataset irds.neumarco.ru.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/dev, with the corpus translated to Russian.

Dataset irds.neumarco.ru.dev.qrels

A version of msmarco-passage/dev, with the corpus translated to Russian.

Dataset irds.neumarco.ru.dev

A version of msmarco-passage/dev, with the corpus translated to Russian.

Dataset irds.neumarco.ru.dev.judged.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/dev/judged, with the corpus translated to Russian.

Dataset irds.neumarco.ru.dev.judged.qrels

A version of msmarco-passage/dev/judged, with the corpus translated to Russian.

Dataset irds.neumarco.ru.dev.judged

A version of msmarco-passage/dev/judged, with the corpus translated to Russian.

Dataset irds.neumarco.ru.dev.small.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/dev/small, with the corpus translated to Russian.

Dataset irds.neumarco.ru.dev.small.qrels

A version of msmarco-passage/dev/small, with the corpus translated to Russian.

Dataset irds.neumarco.ru.dev.small

A version of msmarco-passage/dev/small, with the corpus translated to Russian.

Dataset irds.neumarco.ru.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/train, with the corpus translated to Russian.

Dataset irds.neumarco.ru.train.docpairs: A version of msmarco-passage/train, with the corpus translated to Russian.

Dataset irds.neumarco.ru.train.qrels

A version of msmarco-passage/train, with the corpus translated to Russian.

Dataset irds.neumarco.ru.train

A version of msmarco-passage/train, with the corpus translated to Russian.

Dataset irds.neumarco.ru.train.judged.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/train/judged, with the corpus translated to Russian.

Dataset irds.neumarco.ru.train.judged.docpairs: A version of msmarco-passage/train/judged, with the corpus translated to Russian.

Dataset irds.neumarco.ru.train.judged.qrels

A version of msmarco-passage/train/judged, with the corpus translated to Russian.

Dataset irds.neumarco.ru.train.judged

A version of msmarco-passage/train/judged, with the corpus translated to Russian.

neumarco/zh

The msmarco-passage corpus, translated to Chinese.

Dataset irds.neumarco.zh.documents

The msmarco-passage corpus, translated to Chinese.

Dataset irds.neumarco.zh.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/dev, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.dev.qrels

A version of msmarco-passage/dev, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.dev

A version of msmarco-passage/dev, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.dev.judged.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/dev/judged, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.dev.judged.qrels

A version of msmarco-passage/dev/judged, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.dev.judged

A version of msmarco-passage/dev/judged, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.dev.small.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/dev/small, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.dev.small.qrels

A version of msmarco-passage/dev/small, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.dev.small

A version of msmarco-passage/dev/small, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/train, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.train.docpairs: A version of msmarco-passage/train, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.train.qrels

A version of msmarco-passage/train, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.train

A version of msmarco-passage/train, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.train.judged.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of msmarco-passage/train/judged, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.train.judged.docpairs: A version of msmarco-passage/train/judged, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.train.judged.qrels

A version of msmarco-passage/train/judged, with the corpus translated to Chinese.

Dataset irds.neumarco.zh.train.judged

A version of msmarco-passage/train/judged, with the corpus translated to Chinese.

NFCorpus (NutritionFacts)

"NFCorpus is a full-text English retrieval data set for Medical Information Retrieval. It contains a total of 3,244 natural language queries (written in non-technical English, harvested from the NutritionFacts.org site) with 169,756 automatically extracted relevance judgments for 9,964 medical documents (written in a complex terminology-heavy language), mostly from PubMed."

Dataset irds.nfcorpus.documents

"NFCorpus is a full-text English retrieval data set for Medical Information Retrieval. It contains a total of 3,244 natural language queries (written in non-technical English, harvested from the NutritionFacts.org site) with 169,756 automatically extracted relevance judgments for 9,964 medical documents (written in a complex terminology-heavy language), mostly from PubMed."

Dataset irds.nfcorpus.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official dev set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

Dataset irds.nfcorpus.dev.qrels

Official dev set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

Dataset irds.nfcorpus.dev

Official dev set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

Dataset irds.nfcorpus.dev.nontopic.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official dev set, filtered to exclude queries from topic pages.

Dataset irds.nfcorpus.dev.nontopic.qrels

Official dev set, filtered to exclude queries from topic pages.

Dataset irds.nfcorpus.dev.nontopic

Official dev set, filtered to exclude queries from topic pages.

Dataset irds.nfcorpus.dev.video.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official dev set, filtered to only include queries from video pages.

Dataset irds.nfcorpus.dev.video.qrels

Official dev set, filtered to only include queries from video pages.

Dataset irds.nfcorpus.dev.video

Official dev set, filtered to only include queries from video pages.

Dataset irds.nfcorpus.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official test set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

Dataset irds.nfcorpus.test.qrels

Official test set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

Dataset irds.nfcorpus.test

Official test set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

Dataset irds.nfcorpus.test.nontopic.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official test set, filtered to exclude queries from topic pages.

Dataset irds.nfcorpus.test.nontopic.qrels

Official test set, filtered to exclude queries from topic pages.

Dataset irds.nfcorpus.test.nontopic

Official test set, filtered to exclude queries from topic pages.

Dataset irds.nfcorpus.test.video.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official test set, filtered to only include queries from video pages.

Dataset irds.nfcorpus.test.video.qrels

Official test set, filtered to only include queries from video pages.

Dataset irds.nfcorpus.test.video

Official test set, filtered to only include queries from video pages.

Dataset irds.nfcorpus.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official train set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

Dataset irds.nfcorpus.train.qrels

Official train set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

Dataset irds.nfcorpus.train

Official train set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

Dataset irds.nfcorpus.train.nontopic.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official train set, filtered to exclude queries from topic pages.

Dataset irds.nfcorpus.train.nontopic.qrels

Official train set, filtered to exclude queries from topic pages.

Dataset irds.nfcorpus.train.nontopic

Official train set, filtered to exclude queries from topic pages.

Dataset irds.nfcorpus.train.video.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official train set, filtered to only include queries from video pages.

Dataset irds.nfcorpus.train.video.qrels

Official train set, filtered to only include queries from video pages.

Dataset irds.nfcorpus.train.video

Official train set, filtered to only include queries from video pages.

Natural Questions

Google Natural Questions is a Q&A dataset containing long, short, and Yes/No answers from Wikipedia. ir_datasets frames this around an ad-hoc ranking setting by building a collection of all long answer candidate passages. However, short and Yes/No annotations are also available in the qrels, as are the passages presented to the annotators (via scoreddocs).

Importantly, the document collection does not consist of all Wikipedia passages, but instead a union of the candidate passages presented to the annotators (akin to MS MARCO). dph-w100/natural-questions/train and dph-w100/natural-questions/dev contain a filtered set of the questions in this dataset and a full Wikipedia dump (which is a more realistic retrieval setting).

Dataset irds.natural-questions.documents

Google Natural Questions is a Q&A dataset containing long, short, and Yes/No answers from Wikipedia. ir_datasets frames this around an ad-hoc ranking setting by building a collection of all long answer candidate passages. However, short and Yes/No annotations are also available in the qrels, as are the passages presented to the annotators (via scoreddocs).

Importantly, the document collection does not consist of all Wikipedia passages, but instead a union of the candidate passages presented to the annotators (akin to MS MARCO). dph-w100/natural-questions/train and dph-w100/natural-questions/dev contain a filtered set of the questions in this dataset and a full Wikipedia dump (which is a more realistic retrieval setting).

Dataset irds.natural-questions.dev.queries

Official dev set.

Dataset irds.natural-questions.dev.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official dev set.

Dataset irds.natural-questions.dev.qrels

Official dev set.

Dataset irds.natural-questions.dev

Official dev set.

Dataset irds.natural-questions.train.queries

Official train set.

Dataset irds.natural-questions.train.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official train set.

Dataset irds.natural-questions.train.qrels

Official train set.

Dataset irds.natural-questions.train

Official train set.

NYT

The New York Times Annotated Corpus. Consists of articles published between 1987 and 2007. It is used in TREC Core 2017 and it is also useful for transferring relevance signals in cases where training data is in short supply.

Uses data from LDC2008T19. The source collection can be downloaded from the LDC.

Dataset irds.nyt.documents

The New York Times Annotated Corpus. Consists of articles published between 1987 and 2007. It is used in TREC Core 2017 and it is also useful for transferring relevance signals in cases where training data is in short supply.

Uses data from LDC2008T19. The source collection can be downloaded from the LDC.

Dataset irds.nyt.trec-core-2017.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Common Core 2017 benchmark.

Note that this dataset only contains the 50 queries assessed by NIST.

Queries: TREC-style (keyword, description, narrative)
Relevance: Deeply-annotated
Shared Task Website
Shared Task Paper

Dataset irds.nyt.trec-core-2017.qrels

The TREC Common Core 2017 benchmark.

Note that this dataset only contains the 50 queries assessed by NIST.

Queries: TREC-style (keyword, description, narrative)
Relevance: Deeply-annotated
Shared Task Website
Shared Task Paper

Dataset irds.nyt.trec-core-2017

The TREC Common Core 2017 benchmark.

Note that this dataset only contains the 50 queries assessed by NIST.

Queries: TREC-style (keyword, description, narrative)
Relevance: Deeply-annotated
Shared Task Website
Shared Task Paper

Dataset irds.nyt.wksup.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Training set (without held-out nyt/wksup/valid) for transferring relevance signals from NYT corpus.

Dataset irds.nyt.wksup.qrels

Training set (without held-out nyt/wksup/valid) for transferring relevance signals from NYT corpus.

Dataset irds.nyt.wksup

Training set (without held-out nyt/wksup/valid) for transferring relevance signals from NYT corpus.

Dataset irds.nyt.wksup.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Training set (without held-out nyt/wksup/valid) for transferring relevance signals from NYT corpus.

Dataset irds.nyt.wksup.train.qrels

Training set (without held-out nyt/wksup/valid) for transferring relevance signals from NYT corpus.

Dataset irds.nyt.wksup.train

Training set (without held-out nyt/wksup/valid) for transferring relevance signals from NYT corpus.

Dataset irds.nyt.wksup.valid.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Held-out validation set for transferring relevance signals from NYT corpus (see nyt/wksup/train).

Dataset irds.nyt.wksup.valid.qrels

Held-out validation set for transferring relevance signals from NYT corpus (see nyt/wksup/train).

Dataset irds.nyt.wksup.valid

Held-out validation set for transferring relevance signals from NYT corpus (see nyt/wksup/train).

pmc/v1

Subset of PMC articles used for the TREC 2014 and 2015 tasks (v1). Inclues titles, abstracts, full text. Collected from the open access segment on January 21, 2014.

Information on documents

Dataset irds.pmc.v1.documents

Subset of PMC articles used for the TREC 2014 and 2015 tasks (v1). Inclues titles, abstracts, full text. Collected from the open access segment on January 21, 2014.

Information on documents

Dataset irds.pmc.v1.trec-cds-2014.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Clinical Decision Support (CDS) track from 2014.

Dataset irds.pmc.v1.trec-cds-2014.qrels

The TREC Clinical Decision Support (CDS) track from 2014.

Dataset irds.pmc.v1.trec-cds-2014

The TREC Clinical Decision Support (CDS) track from 2014.

Dataset irds.pmc.v1.trec-cds-2015.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Clinical Decision Support (CDS) track from 2015.

Dataset irds.pmc.v1.trec-cds-2015.qrels

The TREC Clinical Decision Support (CDS) track from 2015.

Dataset irds.pmc.v1.trec-cds-2015

The TREC Clinical Decision Support (CDS) track from 2015.

pmc/v2

Subset of PMC articles used for the TREC 2016 task (v2). Inclues titles, abstracts, full text. Collected from the open access segment on March 28, 2016.

Information on documents

Dataset irds.pmc.v2.documents

Subset of PMC articles used for the TREC 2016 task (v2). Inclues titles, abstracts, full text. Collected from the open access segment on March 28, 2016.

Information on documents

Dataset irds.pmc.v2.trec-cds-2016.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Clinical Decision Support (CDS) track from 2016.

Dataset irds.pmc.v2.trec-cds-2016.qrels

The TREC Clinical Decision Support (CDS) track from 2016.

Dataset irds.pmc.v2.trec-cds-2016

The TREC Clinical Decision Support (CDS) track from 2016.

Touché Image Search

Corpus version 2022-06-13 with 23 841 images. It was released on June 13, 2022 on Zenodo.

This collection is licensed with the Creative Commons Attribution 4.0 International. Individual rights to the content still apply.

Dataset irds.touche-image.2022-06-13.documents

Corpus version 2022-06-13 with 23 841 images. It was released on June 13, 2022 on Zenodo.

This collection is licensed with the Creative Commons Attribution 4.0 International. Individual rights to the content still apply.

Dataset irds.touche-image.2022-06-13.touche-2022-task-3.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2022 is the third lab on argument retrieval at CLEF 2022 featuring three tasks.

Given a controversial topic, the task is to retrieve images (from touche-image/2022-06-13) for each stance (pro/con) that show support for that stance.

Systems are evaluated on Touché topics 1-50 by the ratio of images among the 20 retrieved images for each topic (10 images for each stance) that are all three: relevant to the topic, argumentative, and have the associated stance.

Dataset irds.touche-image.2022-06-13.touche-2022-task-3.qrels

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2022 is the third lab on argument retrieval at CLEF 2022 featuring three tasks.

Given a controversial topic, the task is to retrieve images (from touche-image/2022-06-13) for each stance (pro/con) that show support for that stance.

Systems are evaluated on Touché topics 1-50 by the ratio of images among the 20 retrieved images for each topic (10 images for each stance) that are all three: relevant to the topic, argumentative, and have the associated stance.

Dataset irds.touche-image.2022-06-13.touche-2022-task-3

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2022 is the third lab on argument retrieval at CLEF 2022 featuring three tasks.

Given a controversial topic, the task is to retrieve images (from touche-image/2022-06-13) for each stance (pro/con) that show support for that stance.

Systems are evaluated on Touché topics 1-50 by the ratio of images among the 20 retrieved images for each topic (10 images for each stance) that are all three: relevant to the topic, argumentative, and have the associated stance.

Touché 2022 Task 2: Argument Retrieval for Comparative Questions

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2022 is the third lab on argument retrieval at CLEF 2022 featuring three tasks.

Given a comparative topic and a collection of documents, the task is to retrieve relevant argumentative passages for either compared object or for both and to detect their respective stances with respect to the object they talk about.

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

Additionally, classify the stance of the retrieved text passages towards the compared objects in questions. For instance, in the question Who is a better friend, a cat or a dog? the terms cat and dog are the comparison objects. An answer candidate like Cats can be quite affectionate and attentive, and thus are good friends should be classified as pro the cat object, while Cats are less faithful than dogs as supporting the dog object.

Dataset irds.clueweb12.touche-2022-task-2.documents

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2022 is the third lab on argument retrieval at CLEF 2022 featuring three tasks.

Given a comparative topic and a collection of documents, the task is to retrieve relevant argumentative passages for either compared object or for both and to detect their respective stances with respect to the object they talk about.

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

Additionally, classify the stance of the retrieved text passages towards the compared objects in questions. For instance, in the question Who is a better friend, a cat or a dog? the terms cat and dog are the comparison objects. An answer candidate like Cats can be quite affectionate and attentive, and thus are good friends should be classified as pro the cat object, while Cats are less faithful than dogs as supporting the dog object.

Dataset irds.clueweb12.touche-2022-task-2.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2022 is the third lab on argument retrieval at CLEF 2022 featuring three tasks.

Given a comparative topic and a collection of documents, the task is to retrieve relevant argumentative passages for either compared object or for both and to detect their respective stances with respect to the object they talk about.

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

Additionally, classify the stance of the retrieved text passages towards the compared objects in questions. For instance, in the question Who is a better friend, a cat or a dog? the terms cat and dog are the comparison objects. An answer candidate like Cats can be quite affectionate and attentive, and thus are good friends should be classified as pro the cat object, while Cats are less faithful than dogs as supporting the dog object.

Dataset irds.clueweb12.touche-2022-task-2.qrels

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2022 is the third lab on argument retrieval at CLEF 2022 featuring three tasks.

Given a comparative topic and a collection of documents, the task is to retrieve relevant argumentative passages for either compared object or for both and to detect their respective stances with respect to the object they talk about.

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

Additionally, classify the stance of the retrieved text passages towards the compared objects in questions. For instance, in the question Who is a better friend, a cat or a dog? the terms cat and dog are the comparison objects. An answer candidate like Cats can be quite affectionate and attentive, and thus are good friends should be classified as pro the cat object, while Cats are less faithful than dogs as supporting the dog object.

Dataset irds.clueweb12.touche-2022-task-2

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2022 is the third lab on argument retrieval at CLEF 2022 featuring three tasks.

Given a comparative topic and a collection of documents, the task is to retrieve relevant argumentative passages for either compared object or for both and to detect their respective stances with respect to the object they talk about.

Documents are judged based on their general topical relevance and for rhetorical quality, i.e., "well-writtenness" of the document: (1) whether the text has a good style of speech (formal language is preferred over informal), (2) whether the text has a proper sentence structure and is easy to read, (3) whether it includes profanity, has typos, and makes use of other detrimental style choices.

Additionally, classify the stance of the retrieved text passages towards the compared objects in questions. For instance, in the question Who is a better friend, a cat or a dog? the terms cat and dog are the comparison objects. An answer candidate like Cats can be quite affectionate and attentive, and thus are good friends should be classified as pro the cat object, while Cats are less faithful than dogs as supporting the dog object.

Touché 2022 Task 2: Argument Retrieval for Comparative Questions (Expanded)

Pre-processed version of clueweb12/touche-2022-task-2 where each passage has been expanded with queries generated using DocT5Query.

Dataset irds.clueweb12.touche-2022-task-2.expanded-doc-t5-query.documents

Pre-processed version of clueweb12/touche-2022-task-2 where each passage has been expanded with queries generated using DocT5Query.

Dataset irds.clueweb12.touche-2022-task-2.expanded-doc-t5-query.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Pre-processed version of clueweb12/touche-2022-task-2 where each passage has been expanded with queries generated using DocT5Query.

Dataset irds.clueweb12.touche-2022-task-2.expanded-doc-t5-query.qrels

Pre-processed version of clueweb12/touche-2022-task-2 where each passage has been expanded with queries generated using DocT5Query.

Dataset irds.clueweb12.touche-2022-task-2.expanded-doc-t5-query

Pre-processed version of clueweb12/touche-2022-task-2 where each passage has been expanded with queries generated using DocT5Query.

TREC Arabic

A collection of news articles in Arabic, used for multi-lingual evaluation in TREC 2001 and TREC 2002.

Document collection from LDC2001T55.

Dataset irds.trec-arabic.documents

A collection of news articles in Arabic, used for multi-lingual evaluation in TREC 2001 and TREC 2002.

Document collection from LDC2001T55.

Dataset irds.trec-arabic.ar2001.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Arabic benchmark from TREC 2001.

Task Overview Paper

Dataset irds.trec-arabic.ar2001.qrels

Arabic benchmark from TREC 2001.

Task Overview Paper

Dataset irds.trec-arabic.ar2001

Arabic benchmark from TREC 2001.

Task Overview Paper

Dataset irds.trec-arabic.ar2002.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Arabic benchmark from TREC 2002.

Task Overview Paper

Dataset irds.trec-arabic.ar2002.qrels

Arabic benchmark from TREC 2002.

Task Overview Paper

Dataset irds.trec-arabic.ar2002

Arabic benchmark from TREC 2002.

Task Overview Paper

TREC Mandarin

A collection of news articles in Mandarin in Simplified Chinese, used for multi-lingual evaluation in TREC 5 and TREC 6.

Document collection from LDC2000T52.

Dataset irds.trec-mandarin.documents

A collection of news articles in Mandarin in Simplified Chinese, used for multi-lingual evaluation in TREC 5 and TREC 6.

Document collection from LDC2000T52.

Dataset irds.trec-mandarin.trec5.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Mandarin Chinese benchmark from TREC 5.

Task Overview Paper

Dataset irds.trec-mandarin.trec5.qrels

Mandarin Chinese benchmark from TREC 5.

Task Overview Paper

Dataset irds.trec-mandarin.trec5

Mandarin Chinese benchmark from TREC 5.

Task Overview Paper

Dataset irds.trec-mandarin.trec6.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Mandarin Chinese benchmark from TREC 6.

Task Overview Paper

Dataset irds.trec-mandarin.trec6.qrels

Mandarin Chinese benchmark from TREC 6.

Task Overview Paper

Dataset irds.trec-mandarin.trec6

Mandarin Chinese benchmark from TREC 6.

Task Overview Paper

TREC Spanish

A collection of news articles in Spanish, used for multi-lingual evaluation in TREC 3 and TREC 4.

Document collection from LDC2000T51.

Dataset irds.trec-spanish.documents

A collection of news articles in Spanish, used for multi-lingual evaluation in TREC 3 and TREC 4.

Document collection from LDC2000T51.

Dataset irds.trec-spanish.trec3.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Spanish benchmark from TREC 3.

Task Overview Paper

Dataset irds.trec-spanish.trec3.qrels

Spanish benchmark from TREC 3.

Task Overview Paper

Dataset irds.trec-spanish.trec3

Spanish benchmark from TREC 3.

Task Overview Paper

Dataset irds.trec-spanish.trec4.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Spanish benchmark from TREC 4.

Task Overview Paper

Dataset irds.trec-spanish.trec4.qrels

Spanish benchmark from TREC 4.

Task Overview Paper

Dataset irds.trec-spanish.trec4

Spanish benchmark from TREC 4.

Task Overview Paper

trec-tot/2023

Corpus for the TREC 2023 tip-of-the-tongue search track.

Dataset irds.trec-tot.2023.documents

Corpus for the TREC 2023 tip-of-the-tongue search track.

Dataset irds.trec-tot.2023.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train query set for TREC 2023 tip-of-the-tongue search track.

Dataset irds.trec-tot.2023.train.qrels

Train query set for TREC 2023 tip-of-the-tongue search track.

Dataset irds.trec-tot.2023.train

Train query set for TREC 2023 tip-of-the-tongue search track.

Dataset irds.trec-tot.2023.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Dev query set for TREC 2023 tip-of-the-tongue search track.

Dataset irds.trec-tot.2023.dev.qrels

Dev query set for TREC 2023 tip-of-the-tongue search track.

Dataset irds.trec-tot.2023.dev

Dev query set for TREC 2023 tip-of-the-tongue search track.

trec-tot/2024

Corpus for the TREC 2024 tip-of-the-tongue search track.

Dataset irds.trec-tot.2024.documents

Corpus for the TREC 2024 tip-of-the-tongue search track.

Dataset irds.trec-tot.2024.test.queries

Test query set for TREC 2024 tip-of-the-tongue search track.

TripClick

TripClick is a large collection from the Trip Database. Relevance is inferred from click signals.

A copy of this dataset can be obtained from the Trip Database through the process described here. Documents, queries, and qrels require the "TripClick IR Benchmark"; for scoreddocs and docpairs, you will also need to request the "TripClick Training Package for Deep Learning Models".

Documents: Medline article titles and abstracts
Queries: user queries issued to the Trip Database
Qrels: Inferred from clicks
Dataset request form
Dataset website
Dataset paper

Dataset irds.tripclick.documents

TripClick is a large collection from the Trip Database. Relevance is inferred from click signals.

A copy of this dataset can be obtained from the Trip Database through the process described here. Documents, queries, and qrels require the "TripClick IR Benchmark"; for scoreddocs and docpairs, you will also need to request the "TripClick Training Package for Deep Learning Models".

Documents: Medline article titles and abstracts
Queries: user queries issued to the Trip Database
Qrels: Inferred from clicks
Dataset request form
Dataset website
Dataset paper

Dataset irds.tripclick.test.queries

Test subset of tripclick, including all queries from tripclick/test/head, tripclick/test/torso, and tripclick/test/tail.

The scoreddocs are the official BM25 results from Anserini.

Dataset irds.tripclick.test.scoreddocs

Test subset of tripclick, including all queries from tripclick/test/head, tripclick/test/torso, and tripclick/test/tail.

The scoreddocs are the official BM25 results from Anserini.

Dataset irds.tripclick.test.head.queries

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.test.head.scoreddocs

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.test.tail.queries

The least frequent queries in the test set. This represents 50% of the search engine traffic.

Dataset irds.tripclick.test.tail.scoreddocs

The least frequent queries in the test set. This represents 50% of the search engine traffic.

Dataset irds.tripclick.test.torso.queries

The moderately frequent queries in the test set. This represents 30% of the search engine traffic.

Dataset irds.tripclick.test.torso.scoreddocs

The moderately frequent queries in the test set. This represents 30% of the search engine traffic.

Dataset irds.tripclick.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Training subset of tripclick, including all queries from tripclick/train/head, tripclick/train/torso, and tripclick/train/tail.

The dataset provides docpairs in a full text format; we map this text back to the query and doc IDs. A small number of docpairs could not be mapped back, so they are skipped.

Dataset irds.tripclick.train.docpairs

Training subset of tripclick, including all queries from tripclick/train/head, tripclick/train/torso, and tripclick/train/tail.

The dataset provides docpairs in a full text format; we map this text back to the query and doc IDs. A small number of docpairs could not be mapped back, so they are skipped.

Dataset irds.tripclick.train.qrels

Training subset of tripclick, including all queries from tripclick/train/head, tripclick/train/torso, and tripclick/train/tail.

The dataset provides docpairs in a full text format; we map this text back to the query and doc IDs. A small number of docpairs could not be mapped back, so they are skipped.

Dataset irds.tripclick.train

Training subset of tripclick, including all queries from tripclick/train/head, tripclick/train/torso, and tripclick/train/tail.

The dataset provides docpairs in a full text format; we map this text back to the query and doc IDs. A small number of docpairs could not be mapped back, so they are skipped.

Dataset irds.tripclick.train.head.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The most frequent queries in the train set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.train.head.qrels

The most frequent queries in the train set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.train.head

The most frequent queries in the train set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.train.head.dctr.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The most frequent queries in the train set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.train.head.dctr.qrels

The most frequent queries in the train set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.train.head.dctr

The most frequent queries in the train set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.train.hofstaetter-triples.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A version of tripclick/train that replaces the original (noisy) training triples (docpairs) with those sampled from BM25 instead, as suggested by Hofstätter et al (2022).

Paper

Dataset irds.tripclick.train.hofstaetter-triples.docpairs

A version of tripclick/train that replaces the original (noisy) training triples (docpairs) with those sampled from BM25 instead, as suggested by Hofstätter et al (2022).

Paper

Dataset irds.tripclick.train.hofstaetter-triples.qrels

A version of tripclick/train that replaces the original (noisy) training triples (docpairs) with those sampled from BM25 instead, as suggested by Hofstätter et al (2022).

Paper

Dataset irds.tripclick.train.hofstaetter-triples

A version of tripclick/train that replaces the original (noisy) training triples (docpairs) with those sampled from BM25 instead, as suggested by Hofstätter et al (2022).

Paper

Dataset irds.tripclick.train.tail.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The least frequent queries in the train set. This represents 50% of the search engine traffic.

Dataset irds.tripclick.train.tail.qrels

The least frequent queries in the train set. This represents 50% of the search engine traffic.

Dataset irds.tripclick.train.tail

The least frequent queries in the train set. This represents 50% of the search engine traffic.

Dataset irds.tripclick.train.torso.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The moderately frequent queries in the train set. This represents 30% of the search engine traffic.

Dataset irds.tripclick.train.torso.qrels

The moderately frequent queries in the train set. This represents 30% of the search engine traffic.

Dataset irds.tripclick.train.torso

The moderately frequent queries in the train set. This represents 30% of the search engine traffic.

Dataset irds.tripclick.val.queries

Validation subset of tripclick, including all queries from tripclick/val/head, tripclick/val/torso, and tripclick/val/tail.

The scoreddocs are the official BM25 results from Anserini.

Dataset irds.tripclick.val.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Validation subset of tripclick, including all queries from tripclick/val/head, tripclick/val/torso, and tripclick/val/tail.

The scoreddocs are the official BM25 results from Anserini.

Dataset irds.tripclick.val.qrels

Validation subset of tripclick, including all queries from tripclick/val/head, tripclick/val/torso, and tripclick/val/tail.

The scoreddocs are the official BM25 results from Anserini.

Dataset irds.tripclick.val

Validation subset of tripclick, including all queries from tripclick/val/head, tripclick/val/torso, and tripclick/val/tail.

The scoreddocs are the official BM25 results from Anserini.

Dataset irds.tripclick.val.head.queries

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.val.head.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.val.head.qrels

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.val.head

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.val.head.dctr.queries

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.val.head.dctr.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.val.head.dctr.qrels

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.val.head.dctr

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

Dataset irds.tripclick.val.tail.queries

The least frequent queries in the validation set. This represents 50% of the search engine traffic.

Dataset irds.tripclick.val.tail.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The least frequent queries in the validation set. This represents 50% of the search engine traffic.

Dataset irds.tripclick.val.tail.qrels

The least frequent queries in the validation set. This represents 50% of the search engine traffic.

Dataset irds.tripclick.val.tail

The least frequent queries in the validation set. This represents 50% of the search engine traffic.

Dataset irds.tripclick.val.torso.queries

The moderately frequent queries in the validation set. This represents 30% of the search engine traffic.

Dataset irds.tripclick.val.torso.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The moderately frequent queries in the validation set. This represents 30% of the search engine traffic.

Dataset irds.tripclick.val.torso.qrels

The moderately frequent queries in the validation set. This represents 30% of the search engine traffic.

Dataset irds.tripclick.val.torso

The moderately frequent queries in the validation set. This represents 30% of the search engine traffic.

tripclick/logs

Raw query logs from TripClick.

Note that this subset includes a broader set of documents than the main collection, but they only provide the title and URL.

Dataset irds.tripclick.logs.documents

Raw query logs from TripClick.

Note that this subset includes a broader set of documents than the main collection, but they only provide the title and URL.

Tweets 2013 (Internet Archive)

A collection of tweets from a 2-month window achived by the Internet Achive. This collection can be a stand-in document collection for the TREC Microblog 2013-14 tasks. (Even though it is not exactly the same collection, Sequiera and Lin show that it it close enough.)

This collection is automatically downloaded from the Internet Archive, though download speeds are often slow so it takes some time. ir_datasets constructs a new directory hierarchy during the download process to facilitate fast lookups and slices.

Dataset irds.tweets2013-ia.documents

A collection of tweets from a 2-month window achived by the Internet Achive. This collection can be a stand-in document collection for the TREC Microblog 2013-14 tasks. (Even though it is not exactly the same collection, Sequiera and Lin show that it it close enough.)

This collection is automatically downloaded from the Internet Archive, though download speeds are often slow so it takes some time. ir_datasets constructs a new directory hierarchy during the download process to facilitate fast lookups and slices.

Dataset irds.tweets2013-ia.trec-mb-2013.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

TREC Microblog 2013 test collection.

Dataset irds.tweets2013-ia.trec-mb-2013.qrels

TREC Microblog 2013 test collection.

Dataset irds.tweets2013-ia.trec-mb-2013

TREC Microblog 2013 test collection.

Dataset irds.tweets2013-ia.trec-mb-2014.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

TREC Microblog 2014 test collection.

Dataset irds.tweets2013-ia.trec-mb-2014.qrels

TREC Microblog 2014 test collection.

Dataset irds.tweets2013-ia.trec-mb-2014

TREC Microblog 2014 test collection.

Vaswani

A small corpus of roughly 11,000 scientific abstracts.

Documents: Scientific abstracts
Queries: Natural language keywords
Dataset Information

Dataset irds.vaswani.documents

A small corpus of roughly 11,000 scientific abstracts.

Documents: Scientific abstracts
Queries: Natural language keywords
Dataset Information

Dataset irds.vaswani.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A small corpus of roughly 11,000 scientific abstracts.

Documents: Scientific abstracts
Queries: Natural language keywords
Dataset Information

Dataset irds.vaswani.qrels

A small corpus of roughly 11,000 scientific abstracts.

Documents: Scientific abstracts
Queries: Natural language keywords
Dataset Information

Dataset irds.vaswani

A small corpus of roughly 11,000 scientific abstracts.

Documents: Scientific abstracts
Queries: Natural language keywords
Dataset Information

wapo/v2

Version 2 of the Washington Post collection, consisting of articles published between 2012-2017.

The collection is obtained from NIST by requesting it from NIST here.

body contains all body text in plain text format, including paragrphs and multi-media captions. body_paras_html contains only source paragraphs and contains HTML markup. body_media contains images, videos, tweets, and galeries, along with a link to the content and a textual caption.

Collection Website

Dataset irds.wapo.v2.documents

Version 2 of the Washington Post collection, consisting of articles published between 2012-2017.

The collection is obtained from NIST by requesting it from NIST here.

body contains all body text in plain text format, including paragrphs and multi-media captions. body_paras_html contains only source paragraphs and contains HTML markup. body_media contains images, videos, tweets, and galeries, along with a link to the content and a textual caption.

Collection Website

Dataset irds.wapo.v2.trec-core-2018.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC Common Core 2018 benchmark.

Queries: TREC-style (keyword, description, narrative)
Relevance: Deeply-annotated
Shared Task Website

Dataset irds.wapo.v2.trec-core-2018.qrels

The TREC Common Core 2018 benchmark.

Queries: TREC-style (keyword, description, narrative)
Relevance: Deeply-annotated
Shared Task Website

Dataset irds.wapo.v2.trec-core-2018

The TREC Common Core 2018 benchmark.

Queries: TREC-style (keyword, description, narrative)
Relevance: Deeply-annotated
Shared Task Website

Dataset irds.wapo.v2.trec-news-2018.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC News 2018 Background Linking task. The task is to find relevant background information for the provided articles.

Queries: Articles via the doc_id field
Shared Task Website
Sared task paper

Dataset irds.wapo.v2.trec-news-2018.qrels

The TREC News 2018 Background Linking task. The task is to find relevant background information for the provided articles.

Queries: Articles via the doc_id field
Shared Task Website
Sared task paper

Dataset irds.wapo.v2.trec-news-2018

The TREC News 2018 Background Linking task. The task is to find relevant background information for the provided articles.

Queries: Articles via the doc_id field
Shared Task Website
Sared task paper

Dataset irds.wapo.v2.trec-news-2019.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

The TREC News 2019 Background Linking task. The task is to find relevant background information for the provided articles.

Queries: Articles via the doc_id field
Shared Task Website
Sared task paper

Dataset irds.wapo.v2.trec-news-2019.qrels

The TREC News 2019 Background Linking task. The task is to find relevant background information for the provided articles.

Queries: Articles via the doc_id field
Shared Task Website
Sared task paper

Dataset irds.wapo.v2.trec-news-2019

The TREC News 2019 Background Linking task. The task is to find relevant background information for the provided articles.

Queries: Articles via the doc_id field
Shared Task Website
Sared task paper

wapo/v4

Dataset irds.wapo.v4.documents: → datamaestro_text.datasets.irds.data.Documents

wikiclir/ar

WikiCLIR with Arabic documents.

Dataset irds.wikiclir.ar.documents

WikiCLIR with Arabic documents.

Dataset irds.wikiclir.ar.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Arabic documents.

Dataset irds.wikiclir.ar.qrels

WikiCLIR with Arabic documents.

Dataset irds.wikiclir.ar

WikiCLIR with Arabic documents.

wikiclir/ca

WikiCLIR with Catalan documents.

Dataset irds.wikiclir.ca.documents

WikiCLIR with Catalan documents.

Dataset irds.wikiclir.ca.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Catalan documents.

Dataset irds.wikiclir.ca.qrels

WikiCLIR with Catalan documents.

Dataset irds.wikiclir.ca

WikiCLIR with Catalan documents.

wikiclir/cs

WikiCLIR with Czech documents.

Dataset irds.wikiclir.cs.documents

WikiCLIR with Czech documents.

Dataset irds.wikiclir.cs.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Czech documents.

Dataset irds.wikiclir.cs.qrels

WikiCLIR with Czech documents.

Dataset irds.wikiclir.cs

WikiCLIR with Czech documents.

wikiclir/de

WikiCLIR with German documents.

Dataset irds.wikiclir.de.documents

WikiCLIR with German documents.

Dataset irds.wikiclir.de.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with German documents.

Dataset irds.wikiclir.de.qrels

WikiCLIR with German documents.

Dataset irds.wikiclir.de

WikiCLIR with German documents.

wikiclir/en-simple

WikiCLIR with Simple English documents.

Dataset irds.wikiclir.en-simple.documents

WikiCLIR with Simple English documents.

Dataset irds.wikiclir.en-simple.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Simple English documents.

Dataset irds.wikiclir.en-simple.qrels

WikiCLIR with Simple English documents.

Dataset irds.wikiclir.en-simple

WikiCLIR with Simple English documents.

wikiclir/es

WikiCLIR with Spanish documents.

Dataset irds.wikiclir.es.documents

WikiCLIR with Spanish documents.

Dataset irds.wikiclir.es.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Spanish documents.

Dataset irds.wikiclir.es.qrels

WikiCLIR with Spanish documents.

Dataset irds.wikiclir.es

WikiCLIR with Spanish documents.

wikiclir/fi

WikiCLIR with Finnish documents.

Dataset irds.wikiclir.fi.documents

WikiCLIR with Finnish documents.

Dataset irds.wikiclir.fi.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Finnish documents.

Dataset irds.wikiclir.fi.qrels

WikiCLIR with Finnish documents.

Dataset irds.wikiclir.fi

WikiCLIR with Finnish documents.

wikiclir/fr

WikiCLIR with French documents.

Dataset irds.wikiclir.fr.documents

WikiCLIR with French documents.

Dataset irds.wikiclir.fr.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with French documents.

Dataset irds.wikiclir.fr.qrels

WikiCLIR with French documents.

Dataset irds.wikiclir.fr

WikiCLIR with French documents.

wikiclir/it

WikiCLIR with Italian documents.

Dataset irds.wikiclir.it.documents

WikiCLIR with Italian documents.

Dataset irds.wikiclir.it.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Italian documents.

Dataset irds.wikiclir.it.qrels

WikiCLIR with Italian documents.

Dataset irds.wikiclir.it

WikiCLIR with Italian documents.

wikiclir/ja

WikiCLIR with Japanese documents.

Dataset irds.wikiclir.ja.documents

WikiCLIR with Japanese documents.

Dataset irds.wikiclir.ja.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Japanese documents.

Dataset irds.wikiclir.ja.qrels

WikiCLIR with Japanese documents.

Dataset irds.wikiclir.ja

WikiCLIR with Japanese documents.

wikiclir/ko

WikiCLIR with Korean documents.

Dataset irds.wikiclir.ko.documents

WikiCLIR with Korean documents.

Dataset irds.wikiclir.ko.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Korean documents.

Dataset irds.wikiclir.ko.qrels

WikiCLIR with Korean documents.

Dataset irds.wikiclir.ko

WikiCLIR with Korean documents.

wikiclir/nl

WikiCLIR with Dutch documents.

Dataset irds.wikiclir.nl.documents

WikiCLIR with Dutch documents.

Dataset irds.wikiclir.nl.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Dutch documents.

Dataset irds.wikiclir.nl.qrels

WikiCLIR with Dutch documents.

Dataset irds.wikiclir.nl

WikiCLIR with Dutch documents.

wikiclir/nn

WikiCLIR with Norwegian (Bokmål) documents.

Dataset irds.wikiclir.nn.documents

WikiCLIR with Norwegian (Bokmål) documents.

Dataset irds.wikiclir.nn.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Norwegian (Bokmål) documents.

Dataset irds.wikiclir.nn.qrels

WikiCLIR with Norwegian (Bokmål) documents.

Dataset irds.wikiclir.nn

WikiCLIR with Norwegian (Bokmål) documents.

wikiclir/no

WikiCLIR with Norwegian (Nynorsk) documents.

Dataset irds.wikiclir.no.documents

WikiCLIR with Norwegian (Nynorsk) documents.

Dataset irds.wikiclir.no.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Norwegian (Nynorsk) documents.

Dataset irds.wikiclir.no.qrels

WikiCLIR with Norwegian (Nynorsk) documents.

Dataset irds.wikiclir.no

WikiCLIR with Norwegian (Nynorsk) documents.

wikiclir/pl

WikiCLIR with Polish documents.

Dataset irds.wikiclir.pl.documents

WikiCLIR with Polish documents.

Dataset irds.wikiclir.pl.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Polish documents.

Dataset irds.wikiclir.pl.qrels

WikiCLIR with Polish documents.

Dataset irds.wikiclir.pl

WikiCLIR with Polish documents.

wikiclir/pt

WikiCLIR with Portuguese documents.

Dataset irds.wikiclir.pt.documents

WikiCLIR with Portuguese documents.

Dataset irds.wikiclir.pt.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Portuguese documents.

Dataset irds.wikiclir.pt.qrels

WikiCLIR with Portuguese documents.

Dataset irds.wikiclir.pt

WikiCLIR with Portuguese documents.

wikiclir/ro

WikiCLIR with Romanian documents.

Dataset irds.wikiclir.ro.documents

WikiCLIR with Romanian documents.

Dataset irds.wikiclir.ro.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Romanian documents.

Dataset irds.wikiclir.ro.qrels

WikiCLIR with Romanian documents.

Dataset irds.wikiclir.ro

WikiCLIR with Romanian documents.

wikiclir/ru

WikiCLIR with Russian documents.

Dataset irds.wikiclir.ru.documents

WikiCLIR with Russian documents.

Dataset irds.wikiclir.ru.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Russian documents.

Dataset irds.wikiclir.ru.qrels

WikiCLIR with Russian documents.

Dataset irds.wikiclir.ru

WikiCLIR with Russian documents.

wikiclir/sv

WikiCLIR with Swedish documents.

Dataset irds.wikiclir.sv.documents

WikiCLIR with Swedish documents.

Dataset irds.wikiclir.sv.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Swedish documents.

Dataset irds.wikiclir.sv.qrels

WikiCLIR with Swedish documents.

Dataset irds.wikiclir.sv

WikiCLIR with Swedish documents.

wikiclir/sw

WikiCLIR with Swahili documents.

Dataset irds.wikiclir.sw.documents

WikiCLIR with Swahili documents.

Dataset irds.wikiclir.sw.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Swahili documents.

Dataset irds.wikiclir.sw.qrels

WikiCLIR with Swahili documents.

Dataset irds.wikiclir.sw

WikiCLIR with Swahili documents.

wikiclir/tl

WikiCLIR with Tagalog documents.

Dataset irds.wikiclir.tl.documents

WikiCLIR with Tagalog documents.

Dataset irds.wikiclir.tl.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Tagalog documents.

Dataset irds.wikiclir.tl.qrels

WikiCLIR with Tagalog documents.

Dataset irds.wikiclir.tl

WikiCLIR with Tagalog documents.

wikiclir/tr

WikiCLIR with Turkish documents.

Dataset irds.wikiclir.tr.documents

WikiCLIR with Turkish documents.

Dataset irds.wikiclir.tr.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Turkish documents.

Dataset irds.wikiclir.tr.qrels

WikiCLIR with Turkish documents.

Dataset irds.wikiclir.tr

WikiCLIR with Turkish documents.

wikiclir/uk

WikiCLIR with Ukrainian documents.

Dataset irds.wikiclir.uk.documents

WikiCLIR with Ukrainian documents.

Dataset irds.wikiclir.uk.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Ukrainian documents.

Dataset irds.wikiclir.uk.qrels

WikiCLIR with Ukrainian documents.

Dataset irds.wikiclir.uk

WikiCLIR with Ukrainian documents.

wikiclir/vi

WikiCLIR with Vietnamese documents.

Dataset irds.wikiclir.vi.documents

WikiCLIR with Vietnamese documents.

Dataset irds.wikiclir.vi.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Vietnamese documents.

Dataset irds.wikiclir.vi.qrels

WikiCLIR with Vietnamese documents.

Dataset irds.wikiclir.vi

WikiCLIR with Vietnamese documents.

wikiclir/zh

WikiCLIR with Chinese documents.

Dataset irds.wikiclir.zh.documents

WikiCLIR with Chinese documents.

Dataset irds.wikiclir.zh.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

WikiCLIR with Chinese documents.

Dataset irds.wikiclir.zh.qrels

WikiCLIR with Chinese documents.

Dataset irds.wikiclir.zh

WikiCLIR with Chinese documents.

wikir/en1k

A small version of WikIR for English.

Dataset irds.wikir.en1k.documents

A small version of WikIR for English.

Dataset irds.wikir.en1k.test.queries

Test set of wikir/en1k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en1k.test.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set of wikir/en1k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en1k.test.qrels

Test set of wikir/en1k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en1k.test

Test set of wikir/en1k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en1k.training.queries

Training set of wikir/en1k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en1k.training.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Training set of wikir/en1k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en1k.training.qrels

Training set of wikir/en1k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en1k.training

Training set of wikir/en1k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en1k.validation.queries

Validation set of wikir/en1k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en1k.validation.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Validation set of wikir/en1k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en1k.validation.qrels

Validation set of wikir/en1k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en1k.validation

Validation set of wikir/en1k. Scoreddocs are the provided BM25 run.

wikir/en59k

WikIR for English.

Dataset irds.wikir.en59k.documents

WikIR for English.

Dataset irds.wikir.en59k.test.queries

Test set of wikir/en59k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en59k.test.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set of wikir/en59k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en59k.test.qrels

Test set of wikir/en59k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en59k.test

Test set of wikir/en59k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en59k.training.queries

Training set of wikir/en59k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en59k.training.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Training set of wikir/en59k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en59k.training.qrels

Training set of wikir/en59k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en59k.training

Training set of wikir/en59k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en59k.validation.queries

Validation set of wikir/en59k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en59k.validation.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Validation set of wikir/en59k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en59k.validation.qrels

Validation set of wikir/en59k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en59k.validation

Validation set of wikir/en59k. Scoreddocs are the provided BM25 run.

wikir/en78k

WikIR for English. This is one of the two versions used in Frej2020Wikir.

Dataset irds.wikir.en78k.documents

WikIR for English. This is one of the two versions used in Frej2020Wikir.

Dataset irds.wikir.en78k.test.queries

Test set of wikir/en78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en78k.test.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set of wikir/en78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en78k.test.qrels

Test set of wikir/en78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en78k.test

Test set of wikir/en78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en78k.training.queries

Training set of wikir/en78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en78k.training.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Training set of wikir/en78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en78k.training.qrels

Training set of wikir/en78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en78k.training

Training set of wikir/en78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en78k.validation.queries

Validation set of wikir/en78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en78k.validation.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Validation set of wikir/en78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en78k.validation.qrels

Validation set of wikir/en78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.en78k.validation

Validation set of wikir/en78k. Scoreddocs are the provided BM25 run.

wikir/ens78k

WikIR for English, using the first sentences of articles as queries. This is one of the two versions used in Frej2020Wikir.

Dataset irds.wikir.ens78k.documents

WikIR for English, using the first sentences of articles as queries. This is one of the two versions used in Frej2020Wikir.

Dataset irds.wikir.ens78k.test.queries

Test set of wikir/ens78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.ens78k.test.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set of wikir/ens78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.ens78k.test.qrels

Test set of wikir/ens78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.ens78k.test

Test set of wikir/ens78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.ens78k.training.queries

Training set of wikir/ens78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.ens78k.training.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Training set of wikir/ens78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.ens78k.training.qrels

Training set of wikir/ens78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.ens78k.training

Training set of wikir/ens78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.ens78k.validation.queries

Validation set of wikir/ens78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.ens78k.validation.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Validation set of wikir/ens78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.ens78k.validation.qrels

Validation set of wikir/ens78k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.ens78k.validation

Validation set of wikir/ens78k. Scoreddocs are the provided BM25 run.

wikir/es13k

WikIR for Spanish.

Dataset irds.wikir.es13k.documents

WikIR for Spanish.

Dataset irds.wikir.es13k.test.queries

Test set of wikir/es13k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.es13k.test.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set of wikir/es13k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.es13k.test.qrels

Test set of wikir/es13k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.es13k.test

Test set of wikir/es13k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.es13k.training.queries

Training set of wikir/es13k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.es13k.training.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Training set of wikir/es13k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.es13k.training.qrels

Training set of wikir/es13k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.es13k.training

Training set of wikir/es13k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.es13k.validation.queries

Validation set of wikir/es13k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.es13k.validation.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Validation set of wikir/es13k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.es13k.validation.qrels

Validation set of wikir/es13k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.es13k.validation

Validation set of wikir/es13k. Scoreddocs are the provided BM25 run.

wikir/fr14k

WikIR for French.

Dataset irds.wikir.fr14k.documents

WikIR for French.

Dataset irds.wikir.fr14k.test.queries

Test set of wikir/fr14k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.fr14k.test.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set of wikir/fr14k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.fr14k.test.qrels

Test set of wikir/fr14k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.fr14k.test

Test set of wikir/fr14k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.fr14k.training.queries

Training set of wikir/fr14k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.fr14k.training.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Training set of wikir/fr14k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.fr14k.training.qrels

Training set of wikir/fr14k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.fr14k.training

Training set of wikir/fr14k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.fr14k.validation.queries

Validation set of wikir/fr14k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.fr14k.validation.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Validation set of wikir/fr14k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.fr14k.validation.qrels

Validation set of wikir/fr14k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.fr14k.validation

Validation set of wikir/fr14k. Scoreddocs are the provided BM25 run.

wikir/it16k

WikIR for Italian.

Dataset irds.wikir.it16k.documents

WikIR for Italian.

Dataset irds.wikir.it16k.test.queries

Test set of wikir/it16k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.it16k.test.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test set of wikir/it16k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.it16k.test.qrels

Test set of wikir/it16k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.it16k.test

Test set of wikir/it16k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.it16k.training.queries

Training set of wikir/it16k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.it16k.training.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Training set of wikir/it16k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.it16k.training.qrels

Training set of wikir/it16k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.it16k.training

Training set of wikir/it16k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.it16k.validation.queries

Validation set of wikir/it16k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.it16k.validation.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Validation set of wikir/it16k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.it16k.validation.qrels

Validation set of wikir/it16k. Scoreddocs are the provided BM25 run.

Dataset irds.wikir.it16k.validation

Validation set of wikir/it16k. Scoreddocs are the provided BM25 run.

TREC Fair Ranking

The TREC Fair Ranking track evaluates systems according to how well they fairly rank documents.

Website

Dataset irds.trec-fair.2021.documents

The TREC Fair Ranking track evaluates systems according to how well they fairly rank documents.

Website

Dataset irds.trec-fair.2021.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official TREC Fair Ranking 2021 train set.

Dataset irds.trec-fair.2021.train.qrels

Official TREC Fair Ranking 2021 train set.

Dataset irds.trec-fair.2021.train

Official TREC Fair Ranking 2021 train set.

Dataset irds.trec-fair.2021.eval.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official TREC Fair Ranking 2021 evaluation set.

Dataset irds.trec-fair.2021.eval.qrels

Official TREC Fair Ranking 2021 evaluation set.

Dataset irds.trec-fair.2021.eval

Official TREC Fair Ranking 2021 evaluation set.

trec-fair/2022

The TREC Fair Ranking 2022 track focuses on fairly prioritising Wikimedia articles for editing to provide a fair exposure to articles from different groups.

2022 Track Website

Dataset irds.trec-fair.2022.documents

The TREC Fair Ranking 2022 track focuses on fairly prioritising Wikimedia articles for editing to provide a fair exposure to articles from different groups.

2022 Track Website

Dataset irds.trec-fair.2022.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official TREC Fair Ranking 2022 train set.

Dataset irds.trec-fair.2022.train.qrels

Official TREC Fair Ranking 2022 train set.

Dataset irds.trec-fair.2022.train

Official TREC Fair Ranking 2022 train set.

trec-cast/v0

Version 0 of the TREC CAsT corpus. This version uses documents from the Washington Post (version 2), TREC CAR (version 2), and MS MARCO passage (version 1).

This corpus was originally meant to be used for evaluation of the 2019 task, but the Washington Post corpus was not included for scoring in the final version due to "an error in the process led to ambiguous document ids," and Washington Post documents were removed from participating systems. As such, trec-cast/v1 (which doesn't include the Washington Post) should be used for the 2019 version of the task. However, this version still can be used for the training set (trec-cast/v0/train) or for replicating the original submissions to the track (prior to the removal of Washingotn Post documents).

Task Overview Paper

Dataset irds.trec-cast.v0.documents

Version 0 of the TREC CAsT corpus. This version uses documents from the Washington Post (version 2), TREC CAR (version 2), and MS MARCO passage (version 1).

This corpus was originally meant to be used for evaluation of the 2019 task, but the Washington Post corpus was not included for scoring in the final version due to "an error in the process led to ambiguous document ids," and Washington Post documents were removed from participating systems. As such, trec-cast/v1 (which doesn't include the Washington Post) should be used for the 2019 version of the task. However, this version still can be used for the training set (trec-cast/v0/train) or for replicating the original submissions to the track (prior to the removal of Washingotn Post documents).

Task Overview Paper

Dataset irds.trec-cast.v0.train.queries

Training set provided by TREC CAsT 2019.

Dataset irds.trec-cast.v0.train.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Training set provided by TREC CAsT 2019.

Dataset irds.trec-cast.v0.train.qrels

Training set provided by TREC CAsT 2019.

Dataset irds.trec-cast.v0.train

Training set provided by TREC CAsT 2019.

Dataset irds.trec-cast.v0.train.judged.queries

trec-cast/2019/train, but with queries that do not appear in the qrels removed.

Dataset irds.trec-cast.v0.train.judged.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

trec-cast/2019/train, but with queries that do not appear in the qrels removed.

Dataset irds.trec-cast.v0.train.judged.qrels

trec-cast/2019/train, but with queries that do not appear in the qrels removed.

Dataset irds.trec-cast.v0.train.judged

trec-cast/2019/train, but with queries that do not appear in the qrels removed.

trec-cast/v1

Version 1 of the TREC CAsT corpus. This version uses documents from the TREC CAR (version 2) and MS MARCO passage (version 1). This version of the corpus was used for TREC CAsT 2019 and 2020.

Task Overview Paper

Dataset irds.trec-cast.v1.documents

Version 1 of the TREC CAsT corpus. This version uses documents from the TREC CAR (version 2) and MS MARCO passage (version 1). This version of the corpus was used for TREC CAsT 2019 and 2020.

Task Overview Paper

Dataset irds.trec-cast.v1.2019.queries

Official evaluation set for TREC CAsT 2019.

Dataset irds.trec-cast.v1.2019.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official evaluation set for TREC CAsT 2019.

Dataset irds.trec-cast.v1.2019.qrels

Official evaluation set for TREC CAsT 2019.

Dataset irds.trec-cast.v1.2019

Official evaluation set for TREC CAsT 2019.

Dataset irds.trec-cast.v1.2019.judged.queries

trec-cast/v1/2019, but with queries that do not appear in the qrels removed.

Dataset irds.trec-cast.v1.2019.judged.scoreddocs

→ datamaestro_text.datasets.irds.data.AdhocAssessments

trec-cast/v1/2019, but with queries that do not appear in the qrels removed.

Dataset irds.trec-cast.v1.2019.judged.qrels

trec-cast/v1/2019, but with queries that do not appear in the qrels removed.

Dataset irds.trec-cast.v1.2019.judged

trec-cast/v1/2019, but with queries that do not appear in the qrels removed.

Dataset irds.trec-cast.v1.2020.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Official evaluation set for TREC CAsT 2020.

Task Overview Paper

Dataset irds.trec-cast.v1.2020.qrels

Official evaluation set for TREC CAsT 2020.

Task Overview Paper

Dataset irds.trec-cast.v1.2020

Official evaluation set for TREC CAsT 2020.

Task Overview Paper

Dataset irds.trec-cast.v1.2020.judged.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

trec-cast/v1/2020, but with queries that do not appear in the qrels removed.

Dataset irds.trec-cast.v1.2020.judged.qrels

trec-cast/v1/2020, but with queries that do not appear in the qrels removed.

Dataset irds.trec-cast.v1.2020.judged

trec-cast/v1/2020, but with queries that do not appear in the qrels removed.

hc4/fa

The Persian collection contains English queries and Persian documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Persian is available.

Dataset irds.hc4.fa.documents

The Persian collection contains English queries and Persian documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Persian is available.

Dataset irds.hc4.fa.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development split of hc4/fa.

Dataset irds.hc4.fa.dev.qrels

Development split of hc4/fa.

Dataset irds.hc4.fa.dev

Development split of hc4/fa.

Dataset irds.hc4.fa.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test split of hc4/fa.

Dataset irds.hc4.fa.test.qrels

Test split of hc4/fa.

Dataset irds.hc4.fa.test

Test split of hc4/fa.

Dataset irds.hc4.fa.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train split of hc4/fa.

Dataset irds.hc4.fa.train.qrels

Train split of hc4/fa.

Dataset irds.hc4.fa.train

Train split of hc4/fa.

hc4/ru

The Russian collection contains English queries and Russian documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Russian is available.

Dataset irds.hc4.ru.documents

The Russian collection contains English queries and Russian documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Russian is available.

Dataset irds.hc4.ru.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development split of hc4/ru.

Dataset irds.hc4.ru.dev.qrels

Development split of hc4/ru.

Dataset irds.hc4.ru.dev

Development split of hc4/ru.

Dataset irds.hc4.ru.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test split of hc4/ru.

Dataset irds.hc4.ru.test.qrels

Test split of hc4/ru.

Dataset irds.hc4.ru.test

Test split of hc4/ru.

Dataset irds.hc4.ru.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train split of hc4/ru.

Dataset irds.hc4.ru.train.qrels

Train split of hc4/ru.

Dataset irds.hc4.ru.train

Train split of hc4/ru.

hc4/zh

The Chinese collection contains English queries and Chinese documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Chinese is available.

Dataset irds.hc4.zh.documents

The Chinese collection contains English queries and Chinese documents for retrieval. Human and machine translated queries are provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Chinese is available.

Dataset irds.hc4.zh.dev.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Development split of hc4/zh.

Dataset irds.hc4.zh.dev.qrels

Development split of hc4/zh.

Dataset irds.hc4.zh.dev

Development split of hc4/zh.

Dataset irds.hc4.zh.test.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Test split of hc4/zh.

Dataset irds.hc4.zh.test.qrels

Test split of hc4/zh.

Dataset irds.hc4.zh.test

Test split of hc4/zh.

Dataset irds.hc4.zh.train.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Train split of hc4/zh.

Dataset irds.hc4.zh.train.qrels

Train split of hc4/zh.

Dataset irds.hc4.zh.train

Train split of hc4/zh.

neuclir/1/fa

The Persian collection contains English queries (to be released) and Persian documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Persian is available.

Dataset irds.neuclir.1.fa.documents

The Persian collection contains English queries (to be released) and Persian documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Persian is available.

Dataset irds.neuclir.1.fa.trec-2022.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Topics and assessments for the TREC NeuCLIR 2022 (Persian language CLIR).

Dataset irds.neuclir.1.fa.trec-2022.qrels

Topics and assessments for the TREC NeuCLIR 2022 (Persian language CLIR).

Dataset irds.neuclir.1.fa.trec-2022

Topics and assessments for the TREC NeuCLIR 2022 (Persian language CLIR).

Dataset irds.neuclir.1.fa.trec-2023.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Topics and assessments for the TREC NeuCLIR 2023 (Persian language CLIR).

Dataset irds.neuclir.1.fa.trec-2023.qrels

Topics and assessments for the TREC NeuCLIR 2023 (Persian language CLIR).

Dataset irds.neuclir.1.fa.trec-2023

Topics and assessments for the TREC NeuCLIR 2023 (Persian language CLIR).

neuclir/1/fa/hc4-filtered

Subset of the Persian collection that intersect with HC4. The 60 queries are the hc4/fa/dev and hc4/fa/test sets combined.

Dataset irds.neuclir.1.fa.hc4-filtered.documents

Subset of the Persian collection that intersect with HC4. The 60 queries are the hc4/fa/dev and hc4/fa/test sets combined.

Dataset irds.neuclir.1.fa.hc4-filtered.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of the Persian collection that intersect with HC4. The 60 queries are the hc4/fa/dev and hc4/fa/test sets combined.

Dataset irds.neuclir.1.fa.hc4-filtered.qrels

Subset of the Persian collection that intersect with HC4. The 60 queries are the hc4/fa/dev and hc4/fa/test sets combined.

Dataset irds.neuclir.1.fa.hc4-filtered

Subset of the Persian collection that intersect with HC4. The 60 queries are the hc4/fa/dev and hc4/fa/test sets combined.

neuclir/1/multi

A combined corpus of NeuCLIR v1 including all Persian, Russian, and Chinese documents.

Dataset irds.neuclir.1.multi.documents

A combined corpus of NeuCLIR v1 including all Persian, Russian, and Chinese documents.

Dataset irds.neuclir.1.multi.trec-2023.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Topics and assessments for the TREC NeuCLIR 2023 multi-language retrieval task.

Dataset irds.neuclir.1.multi.trec-2023.qrels

Topics and assessments for the TREC NeuCLIR 2023 multi-language retrieval task.

Dataset irds.neuclir.1.multi.trec-2023

Topics and assessments for the TREC NeuCLIR 2023 multi-language retrieval task.

neuclir/1/ru

The Russian collection contains English queries (to be released) and Russian documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Russian is available.

Dataset irds.neuclir.1.ru.documents

The Russian collection contains English queries (to be released) and Russian documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Russian is available.

Dataset irds.neuclir.1.ru.trec-2022.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Topics and assessments for the TREC NeuCLIR 2022 (Russian language CLIR).

Dataset irds.neuclir.1.ru.trec-2022.qrels

Topics and assessments for the TREC NeuCLIR 2022 (Russian language CLIR).

Dataset irds.neuclir.1.ru.trec-2022

Topics and assessments for the TREC NeuCLIR 2022 (Russian language CLIR).

Dataset irds.neuclir.1.ru.trec-2023.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Topics and assessments for the TREC NeuCLIR 2023 (Russian language CLIR).

Dataset irds.neuclir.1.ru.trec-2023.qrels

Topics and assessments for the TREC NeuCLIR 2023 (Russian language CLIR).

Dataset irds.neuclir.1.ru.trec-2023

Topics and assessments for the TREC NeuCLIR 2023 (Russian language CLIR).

neuclir/1/ru/hc4-filtered

Subset of the Russian collection that intersect with HC4. The 54 queries are the hc4/ru/dev and hc4/ru/test sets combined.

Dataset irds.neuclir.1.ru.hc4-filtered.documents

Subset of the Russian collection that intersect with HC4. The 54 queries are the hc4/ru/dev and hc4/ru/test sets combined.

Dataset irds.neuclir.1.ru.hc4-filtered.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of the Russian collection that intersect with HC4. The 54 queries are the hc4/ru/dev and hc4/ru/test sets combined.

Dataset irds.neuclir.1.ru.hc4-filtered.qrels

Subset of the Russian collection that intersect with HC4. The 54 queries are the hc4/ru/dev and hc4/ru/test sets combined.

Dataset irds.neuclir.1.ru.hc4-filtered

Subset of the Russian collection that intersect with HC4. The 54 queries are the hc4/ru/dev and hc4/ru/test sets combined.

neuclir/1/zh

The Chinese collection contains English queries (to be released) and Chinese documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Chinese is available.

Dataset irds.neuclir.1.zh.documents

The Chinese collection contains English queries (to be released) and Chinese documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Chinese is available.

Dataset irds.neuclir.1.zh.trec-2022.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Topics and assessments for the TREC NeuCLIR 2022 (Chinese language CLIR).

Dataset irds.neuclir.1.zh.trec-2022.qrels

Topics and assessments for the TREC NeuCLIR 2022 (Chinese language CLIR).

Dataset irds.neuclir.1.zh.trec-2022

Topics and assessments for the TREC NeuCLIR 2022 (Chinese language CLIR).

Dataset irds.neuclir.1.zh.trec-2023.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Topics and assessments for the TREC NeuCLIR 2023 (Chinese language CLIR).

Dataset irds.neuclir.1.zh.trec-2023.qrels

Topics and assessments for the TREC NeuCLIR 2023 (Chinese language CLIR).

Dataset irds.neuclir.1.zh.trec-2023

Topics and assessments for the TREC NeuCLIR 2023 (Chinese language CLIR).

neuclir/1/zh/hc4-filtered

Subset of the Chinse collection that intersect with HC4. The 60 queries are the hc4/zh/dev and hc4/zh/test sets combined.

Dataset irds.neuclir.1.zh.hc4-filtered.documents

Subset of the Chinse collection that intersect with HC4. The 60 queries are the hc4/zh/dev and hc4/zh/test sets combined.

Dataset irds.neuclir.1.zh.hc4-filtered.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

Subset of the Chinse collection that intersect with HC4. The 60 queries are the hc4/zh/dev and hc4/zh/test sets combined.

Dataset irds.neuclir.1.zh.hc4-filtered.qrels

Subset of the Chinse collection that intersect with HC4. The 60 queries are the hc4/zh/dev and hc4/zh/test sets combined.

Dataset irds.neuclir.1.zh.hc4-filtered

Subset of the Chinse collection that intersect with HC4. The 60 queries are the hc4/zh/dev and hc4/zh/test sets combined.

SARA

A set of sensitivity-aware relevance assessments. More information is avaliable here:

SARA

Dataset irds.sara.documents

A set of sensitivity-aware relevance assessments. More information is avaliable here:

SARA

Dataset irds.sara.queries

→ datamaestro_text.datasets.irds.data.AdhocAssessments

A set of sensitivity-aware relevance assessments. More information is avaliable here:

SARA

Dataset irds.sara.qrels

A set of sensitivity-aware relevance assessments. More information is avaliable here:

SARA

Dataset irds.sara