Information Retrieval API

This module provides data types for Information Retrieval datasets and experiments.

The core abstractions are:

Documents - Collections of documents to be searched
Topics - Queries or information needs
Assessments - Relevance judgments (qrels) linking topics to relevant documents
Adhoc - A complete IR test collection combining documents, topics, and assessments

For training neural rankers:

TrainingTriplets - Training data as (query, positive_doc, negative_doc) triplets
PairwiseSampleDataset - General pairwise training data

Data objects

class datamaestro_text.data.ir.base.IDItem(id: str)

Bases: Item, ABC

A topic/document with an external ID

class datamaestro_text.data.ir.base.InternalIDItem(id: int)

Bases: Item, ABC

A topic/document with an internal ID

class datamaestro_text.data.ir.base.ScoredItem(score: float)

Bases: Item

A score associated with the document

score: float: A retrieval score associated with this record (e.g. of the first-stage retriever)

class datamaestro_text.data.ir.base.SimpleTextItem(text: str)

Bases: TextItem

A topic/document with a text record

class datamaestro_text.data.ir.base.TextItem

Bases: Item, ABC

abstract property text: str: Returns the text

class datamaestro_text.data.ir.base.UrlItem(url: str)

Bases: Item

An url item

datamaestro_text.data.ir.base.create_record(*items: Item, id: str = None, text: str = None) → Record: Easy creation of a text/id item

Collection

XPM Configdatamaestro_text.data.ir.Adhoc(*, id, documents, topics, assessments)

Bases: Base

An Adhoc IR collection with documents, topics and their assessments

id: str: The unique (sub-)dataset ID

documents: datamaestro_text.data.ir.Documents: The set of documents

topics: datamaestro_text.data.ir.Topics: The set of topics

assessments: datamaestro_text.data.ir.AdhocAssessments: The set of assessments (for each topic)

XPM Configdatamaestro_text.datasets.irds.data.Adhoc(*, irds, id, documents, topics, assessments)

Bases: Adhoc, IRDSId

irds: str: The id to load the dataset from ir_datasets

id: str: The unique (sub-)dataset ID

documents: datamaestro_text.data.ir.Documents: The set of documents

topics: datamaestro_text.data.ir.Topics: The set of topics

assessments: datamaestro_text.data.ir.AdhocAssessments: The set of assessments (for each topic)

Topics

XPM Configdatamaestro_text.data.ir.Topics(*, id)

Bases: Base, ABC

A set of topics with associated IDs

id: str: The unique (sub-)dataset ID

count() → int | None: Returns the number of topics if known

abstract iter() → Iterator[Record]: Returns an iterator over topics

XPM Configdatamaestro_text.data.ir.csv.Topics(*, id, path, separator)

Bases: Topics

Pairs of query id - query using a separator

id: str: The unique (sub-)dataset ID

path: path

separator: str

XPM Configdatamaestro_text.data.ir.TopicsStore(*, id)

Bases: Topics

Adhoc topics store

id: str: The unique (sub-)dataset ID

XPM Configdatamaestro_text.transforms.ir.TopicWrapper

Bases: Config, ABC

Modify topics on the fly using a topic wrapper

Dataset-specific Topics

XPM Configdatamaestro_text.data.ir.trec.TrecTopics(*, id, path, parts)

Bases: Topics

id: str: The unique (sub-)dataset ID

path: path

parts: List[str]

XPM Configdatamaestro_text.data.ir.cord19.Topics(*, id, path)

Bases: Topics, File

XML format used in Adhoc topics

id: str: The unique (sub-)dataset ID

path: path: The path of the file

XPM Configdatamaestro_text.datasets.irds.data.Topics(*, irds, id)

Bases: TopicsStore, IRDSId

irds: str: The id to load the dataset from ir_datasets

id: str: The unique (sub-)dataset ID

Documents

XPM Configdatamaestro_text.data.ir.Documents(*, id, count)

Bases: Base

A set of documents with identifiers

See IR Datasets for the list of query classes

id: str: The unique (sub-)dataset ID

count: int: Number of documents

property documentcount: Returns the number of terms in the index

iter_ids() → Iterator[str]

Iterates over document ids

By default, use iter_documents, which is not really efficient.

XPM Configdatamaestro_text.data.ir.csv.Documents(*, id, count, path, separator)

Bases: Documents

One line per document, format pid<SEP>text

id: str: The unique (sub-)dataset ID

count: int: Number of documents

path: path

separator: str

XPM Configdatamaestro_text.datasets.irds.data.LZ4DocumentStore(*, id, count, file_access, path, lookup_field)

Bases: DocumentStore, ABC

A LZ4-based document store

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path

lookup_field: str

XPM Configdatamaestro_text.datasets.irds.data.LZ4JSONLDocumentStore(*, id, count, file_access, path, lookup_field)

Bases: LZ4DocumentStore

json-l based document store

Each line is of the form `json { "id": "...", "text": "..." } `

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path

lookup_field: str

IR-Datasets Base

XPM Configdatamaestro_text.datasets.irds.data.IRDSId(*, irds)

Bases: Config

irds: str: The id to load the dataset from ir_datasets

Dataset-specific documents

XPM Configdatamaestro_text.data.ir.cord19.Documents(*, id, path, delimiter, ignore, names_row, count)

Bases: Documents, Generic

id: str: The unique (sub-)dataset ID

path: path: The path of the file

delimiter: str = ,

ignore: int = 0

names_row: int = -1

count: int: Number of documents

XPM Configdatamaestro_text.data.ir.trec.TipsterCollection(*, id, count, path)

Bases: Documents

id: str: The unique (sub-)dataset ID

count: int: Number of documents

path: path

XPM Configdatamaestro_text.data.ir.stores.OrConvQADocumentStore(*, id, count, file_access, path)

Bases: LZ4DocumentStore

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path

lookup_field: str = idconstant

index_fields: List[str] = ['id']constant

fields: List[str] = ['id', 'title', 'body', 'aid', 'bid']constant

XPM Configdatamaestro_text.data.ir.stores.IKatClueWeb22DocumentStore(*, id, count, file_access, path)

Bases: LZ4DocumentStore

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

path: path

lookup_field: str = idconstant

index_fields: List[str] = ['id']constant

XPM Configdatamaestro_text.datasets.irds.data.Documents(*, irds, id, count, file_access)

Bases: DocumentStore, IRDSId

irds: str: The id to load the dataset from ir_datasets

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

Assessments

XPM Configdatamaestro_text.data.ir.AdhocAssessments(*, id)

Bases: Base, ABC

Ad-hoc assessments (qrels)

id: str: The unique (sub-)dataset ID

iter() → Iterator[AdhocAssessedTopic]: Returns an iterator over assessments

XPM Configdatamaestro_text.data.ir.trec.TrecAdhocAssessments(*, id, path)

Bases: AdhocAssessments

id: str: The unique (sub-)dataset ID

path: path

XPM Configdatamaestro_text.datasets.irds.data.AdhocAssessments(*, irds, id)

Bases: AdhocAssessments, IRDSId

irds: str: The id to load the dataset from ir_datasets

id: str: The unique (sub-)dataset ID

class datamaestro_text.data.ir.AdhocAssessedTopic(topic_id: str, assessments: List[AdhocAssessment]): Bases: object

class datamaestro_text.data.ir.AdhocAssessment(doc_id: str): Bases: object

Runs

XPM Configdatamaestro_text.data.ir.AdhocRun(*, id)

Bases: Base

IR adhoc run

id: str: The unique (sub-)dataset ID

XPM Configdatamaestro_text.data.ir.csv.AdhocRunWithText(*, id, path, separator)

Bases: AdhocRun

(qid, doc.id, query, passage)

id: str: The unique (sub-)dataset ID

path: path

separator: str

XPM Configdatamaestro_text.data.ir.trec.TrecAdhocRun(*, id, path)

Bases: AdhocRun

id: str: The unique (sub-)dataset ID

path: path

XPM Configdatamaestro_text.datasets.irds.data.AdhocRun(*, irds, id)

Bases: AdhocRun, IRDSId

irds: str: The id to load the dataset from ir_datasets

id: str: The unique (sub-)dataset ID

Results

XPM Configdatamaestro_text.data.ir.AdhocResults(*, id)

Bases: Base

id: str: The unique (sub-)dataset ID

XPM Configdatamaestro_text.data.ir.trec.TrecAdhocResults(*, id, metrics, results, detailed)

Bases: AdhocResults

Adhoc results (TREC format)

id: str: The unique (sub-)dataset ID

metrics: List[datamaestro_text.data.ir.Measure]: List of reported metrics

results: path: Main results

detailed: path: Results per topic (if any)

get_results() → Dict[str, float]: Returns the results as a dictionary {metric_name: value}

Evaluation

XPM Configdatamaestro_text.data.ir.Measure

Bases: Config

An Information Retrieval measure

Reranking

XPM Configdatamaestro_text.data.ir.RerankAdhoc(*, id, documents, topics, assessments, run)

Bases: Adhoc

Re-ranking ad-hoc task based on an existing run

id: str: The unique (sub-)dataset ID

documents: datamaestro_text.data.ir.Documents: The set of documents

topics: datamaestro_text.data.ir.Topics: The set of topics

assessments: datamaestro_text.data.ir.AdhocAssessments: The set of assessments (for each topic)

run: datamaestro_text.data.ir.AdhocRun: The run to re-rank

Document Index

XPM Configdatamaestro_text.data.ir.DocumentStore(*, id, count, file_access)

Bases: Documents

A document store

A document store can - match external/internal ID - return the document content - return the number of documents

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

docid_internal2external(docid: int): Converts an internal collection ID (integer) to an external ID

document_ext(docid: str) → Record: Returns a document given its external ID

document_int(internal_docid: int) → Record: Returns a document given its internal ID

property documentcount: Returns the number of terms in the index

iter_sample(randint: Callable[[int], int] | None) → Iterator[Record]: Sample documents from the dataset

XPM Configdatamaestro_text.data.ir.AdhocIndex(*, id, count, file_access)

Bases: DocumentStore

An index can be used to retrieve documents based on terms

id: str: The unique (sub-)dataset ID

count: int: Number of documents

file_access: FileAccess = FileAccess.MMAP: How to access the file collection (might not have any impact, depends on the docstore)

term_df(term: str): Returns the document frequency

property termcount: Returns the number of terms in the index

Training triplets

XPM Configdatamaestro_text.data.ir.TrainingTriplets(*, id)

Bases: Base, ABC

Triplet for training IR systems: query / query ID, positive document, negative document

id: str: The unique (sub-)dataset ID

iter() → Iterator[Tuple[Record, Record, Record]]: Returns an iterator over (topic, document 1, document) triplets

XPM Configdatamaestro_text.data.ir.PairwiseSampleDataset(*, id)

Bases: Base, ABC

Datasets where each record is a query with positive and negative samples

id: str: The unique (sub-)dataset ID

XPM Configdatamaestro_text.data.ir.TrainingTripletsLines(*, id, sep, path, doc_ids, topic_ids)

Bases: TrainingTriplets

Training triplets with one line per triple (query texts)

id: str: The unique (sub-)dataset ID

sep: str

path: path

doc_ids: bool: True if we have documents IDs

topic_ids: bool: True if we have query IDs

XPM Configdatamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset(*, id, repo_id, name, data_files, split, streaming, local_path, ids, query_id, pos_id, neg_id)

Bases: HuggingFaceDataset, PairwiseSampleDataset

Triplet for training IR systems: query / query ID, positive document, negative document

id: str: The unique (sub-)dataset ID

repo_id: str: The HuggingFace repository id (e.g. user/dataset).

name: str: HuggingFace dataset name (a.k.a. config).

data_files: str: Specific data files to load.

split: str: Dataset split to load.

streaming: bool = False: When True, load the dataset in streaming mode — no local cache.

local_path: path: If set, load from this local mirror instead of the HuggingFace Hub. Meta because the logical dataset is the same regardless of where the bytes come from.

ids: bool: True if the triplet is made of IDs, False otherwise

query_id: str = qid: The name of the field containing the query ID

pos_id: str = pos: The name of the field containing the positive samples

neg_id: str = neg: The name of the field containing the negative samples

XPM Configdatamaestro_text.datasets.irds.data.TrainingTriplets(*, irds, id)

Bases: TrainingTriplets, IRDSId

Training triplets from IR Dataset

irds: str: The id to load the dataset from ir_datasets

id: str: The unique (sub-)dataset ID

Transforms

XPM Configdatamaestro_text.transforms.ir.StoreTrainingTripletTopicAdapter(*, id, store, data)

Bases: TrainingTriplets

Retrieve an adhoc topic text from a topic store (given the topic ID)

id: str

store: datamaestro_text.data.ir.TopicsStore: The topic store to use

data: datamaestro_text.data.ir.TrainingTriplets: Input data

XPM Configdatamaestro_text.transforms.ir.StoreTrainingTripletDocumentAdapter(*, id, store, data)

Bases: TrainingTriplets

Transforms training triplets to add the document text from a document store

id: str

store: datamaestro_text.data.ir.DocumentStore: The topic store to use

data: datamaestro_text.data.ir.TrainingTriplets: Input data

XPM Taskdatamaestro_text.transforms.ir.ShuffledTrainingTripletsLines(*, data, doc_ids, topic_ids, seed, compressed, sample_rate, sample_max)

Bases: Task

Submit type: Any

Shuffle a set of training triplets

data: datamaestro_text.data.ir.TrainingTriplets: Input data

path: pathgenerated: Output path

doc_ids: bool: Whether to use document ids

topic_ids: bool: True if we have query IDs

seed: int: The random seed

compressed: bool = True: Compress the output

sample_rate: float = 1.0: Sampling rate - set to 1 to keep all the samples

sample_max: int = 0: Maximum number of samples

tmp_path: pathgenerated: Path where temporary files will be stored