Conversational IR Datasets

This section lists datasets for conversational information retrieval and contextual query understanding tasks.

Contextual Query Rewriting

These datasets contain conversational queries that need to be rewritten to be self-contained (decontextualization), resolving coreferences and ellipses from the conversation context.

CANARD

Context-dependent Query Rewriting dataset for conversational question answering. Contains queries from QuAC that have been manually rewritten to be self-contained.

Dataset com.github.aagohary.canard

→ datamaestro.data.ml.Supervised

Question-in-context rewriting

Tags: conversation, query, context

Tasks: query rewriting

External link: https://sites.google.com/view/qanta/projects/canard

CANARD is a dataset for question-in-context rewriting that consists of questions each given in a dialog context together with a context-independent rewriting of the question. The context of each question is the dialog utterances that precede the question. CANARD can be used to evaluate question rewriting models that handle important linguistic phenomena such as co-reference and ellipsis resolution.

Each dataset is an instance of :class:datamaestro_text.data.conversation.CanardDataset

Example:

from datamaestro import prepare_dataset

canard = prepare_dataset("com.github.aagohary.canard.train")
for entry in canard.iter():
    print(f"Original: {entry.source}")
    print(f"Rewritten: {entry.rewrite}")

OrConvQA

Open-Retrieval Conversational Question Answering dataset. Contains multi-turn QA conversations with passage retrieval.

Dataset com.github.prdwb.orconvqa.preprocessed

→ datamaestro.data.ml.Supervised

Open-Retrieval Conversational Question Answering datasets

Tags: conversation, query, context

Tasks: query rewriting

External link: https://github.com/prdwb/orconvqa-release

OrConvQA is an aggregation of three existing datasets:

the QuAC dataset that offers information-seeking conversations,
the CANARD dataset that consists of context-independent rewrites of QuAC questions, and
the Wikipedia corpus that serves as the knowledge source of answering questions.

Each dataset is an instance of :class:datamaestro_text.data.conversation.OrConvQADataset

Dataset com.github.prdwb.orconvqa.passages

→ datamaestro_text.data.ir.stores.OrConvQADocumentStore

orConvQA wikipedia files

External link: https://github.com/prdwb/orconvqa-release

OrConvQA is an aggregation of three existing datasets:

the QuAC dataset that offers information-seeking conversations,
the CANARD dataset that consists of context-independent rewrites of QuAC questions, and
the Wikipedia corpus that serves as the knowledge source of answering questions.

QReCC

Question Rewriting in Conversational Context dataset. Contains conversations with human rewrites of questions.

Dataset com.github.apple.ml-qrecc

→ datamaestro.data.ml.Supervised

Open-Domain Question Answering Goes Conversational via Question Rewriting

Tags: conversation, query, context

Tasks: query rewriting

External link: https://github.com/apple/ml-qrecc

We introduce QReCC (Question Rewriting in Conversational Context), an end-to-end open-domain question answering dataset comprising of 14K conversations with 81K question-answer pairs. The goal of this dataset is to provide a challenging benchmark for end-to-end conversational question answering that includes the individual subtasks of question rewriting, passage retrieval and reading comprehension

Dataset com.github.apple.ml-qrecc.content

→ datamaestro_text.datasets.irds.data.LZ4JSONLDocumentStore

QReCC mentionned URLs content

External link: https://github.com/apple/ml-qrecc