Information Retrieval API

This module provides data types for Information Retrieval datasets and experiments.

The core abstractions are:

  • Documents - Collections of documents to be searched

  • Topics - Queries or information needs

  • Assessments - Relevance judgments (qrels) linking topics to relevant documents

  • Adhoc - A complete IR test collection combining documents, topics, and assessments

For training neural rankers:

  • TrainingTriplets - Training data as (query, positive_doc, negative_doc) triplets

  • PairwiseSampleDataset - General pairwise training data

Data objects

class datamaestro_text.data.ir.base.IDItem(id: str)

Bases: Item, ABC

A topic/document with an external ID

class datamaestro_text.data.ir.base.InternalIDItem(id: int)

Bases: Item, ABC

A topic/document with an internal ID

class datamaestro_text.data.ir.base.ScoredItem(score: float)

Bases: Item

A score associated with the document

score: float

A retrieval score associated with this record (e.g. of the first-stage retriever)

class datamaestro_text.data.ir.base.SimpleTextItem(text: str)

Bases: TextItem

A topic/document with a text record

class datamaestro_text.data.ir.base.TextItem

Bases: Item, ABC

abstract property text: str

Returns the text

class datamaestro_text.data.ir.base.UrlItem(url: str)

Bases: Item

An url item

datamaestro_text.data.ir.base.create_record(*items: Item, id: str = None, text: str = None) Record

Easy creation of a text/id item

Collection

XPM Configdatamaestro_text.data.ir.Adhoc(*, id, documents, topics, assessments)

Bases: Base

An Adhoc IR collection with documents, topics and their assessments

id: str

The unique (sub-)dataset ID

documents: datamaestro_text.data.ir.Documents

The set of documents

topics: datamaestro_text.data.ir.Topics

The set of topics

assessments: datamaestro_text.data.ir.AdhocAssessments

The set of assessments (for each topic)

XPM Configdatamaestro_text.datasets.irds.data.Adhoc(*, irds, id, documents, topics, assessments)

Bases: Adhoc, IRDSId

irds: str

The id to load the dataset from ir_datasets

id: str

The unique (sub-)dataset ID

documents: datamaestro_text.data.ir.Documents

The set of documents

topics: datamaestro_text.data.ir.Topics

The set of topics

assessments: datamaestro_text.data.ir.AdhocAssessments

The set of assessments (for each topic)

Topics

XPM Configdatamaestro_text.data.ir.Topics(*, id)

Bases: Base, ABC

A set of topics with associated IDs

id: str

The unique (sub-)dataset ID

count() int | None

Returns the number of topics if known

abstract iter() Iterator[Record]

Returns an iterator over topics

XPM Configdatamaestro_text.data.ir.csv.Topics(*, id, path, separator)

Bases: Topics

Pairs of query id - query using a separator

id: str

The unique (sub-)dataset ID

path: path
separator: str
XPM Configdatamaestro_text.data.ir.TopicsStore(*, id)

Bases: Topics

Adhoc topics store

id: str

The unique (sub-)dataset ID

XPM Configdatamaestro_text.transforms.ir.TopicWrapper

Bases: Config, ABC

Modify topics on the fly using a topic wrapper

Dataset-specific Topics

XPM Configdatamaestro_text.data.ir.trec.TrecTopics(*, id, path, parts)

Bases: Topics

id: str

The unique (sub-)dataset ID

path: path
parts: List[str]
XPM Configdatamaestro_text.data.ir.cord19.Topics(*, id, path)

Bases: Topics, File

XML format used in Adhoc topics

id: str

The unique (sub-)dataset ID

path: path

The path of the file

XPM Configdatamaestro_text.datasets.irds.data.Topics(*, irds, id)

Bases: TopicsStore, IRDSId

irds: str

The id to load the dataset from ir_datasets

id: str

The unique (sub-)dataset ID

Documents

XPM Configdatamaestro_text.data.ir.Documents(*, id, count)

Bases: Base

A set of documents with identifiers

See IR Datasets for the list of query classes

id: str

The unique (sub-)dataset ID

count: int

Number of documents

property documentcount

Returns the number of terms in the index

iter_ids() Iterator[str]

Iterates over document ids

By default, use iter_documents, which is not really efficient.

XPM Configdatamaestro_text.data.ir.csv.Documents(*, id, count, path, separator)

Bases: Documents

One line per document, format pid<SEP>text

id: str

The unique (sub-)dataset ID

count: int

Number of documents

path: path
separator: str
XPM Configdatamaestro_text.datasets.irds.data.LZ4DocumentStore(*, id, count, file_access, path, lookup_field)

Bases: DocumentStore, ABC

A LZ4-based document store

id: str

The unique (sub-)dataset ID

count: int

Number of documents

file_access: FileAccess = FileAccess.MMAP

How to access the file collection (might not have any impact, depends on the docstore)

path: path
lookup_field: str
XPM Configdatamaestro_text.datasets.irds.data.LZ4JSONLDocumentStore(*, id, count, file_access, path, lookup_field)

Bases: LZ4DocumentStore

json-l based document store

Each line is of the form `json { "id": "...", "text": "..." } `

id: str

The unique (sub-)dataset ID

count: int

Number of documents

file_access: FileAccess = FileAccess.MMAP

How to access the file collection (might not have any impact, depends on the docstore)

path: path
lookup_field: str

IR-Datasets Base

XPM Configdatamaestro_text.datasets.irds.data.IRDSId(*, irds)

Bases: Config

irds: str

The id to load the dataset from ir_datasets

Dataset-specific documents

XPM Configdatamaestro_text.data.ir.cord19.Documents(*, id, path, delimiter, ignore, names_row, count)

Bases: Documents, Generic

id: str

The unique (sub-)dataset ID

path: path

The path of the file

delimiter: str = ,
ignore: int = 0
names_row: int = -1
count: int

Number of documents

XPM Configdatamaestro_text.data.ir.trec.TipsterCollection(*, id, count, path)

Bases: Documents

id: str

The unique (sub-)dataset ID

count: int

Number of documents

path: path
XPM Configdatamaestro_text.data.ir.stores.OrConvQADocumentStore(*, id, count, file_access, path)

Bases: LZ4DocumentStore

id: str

The unique (sub-)dataset ID

count: int

Number of documents

file_access: FileAccess = FileAccess.MMAP

How to access the file collection (might not have any impact, depends on the docstore)

path: path
lookup_field: str = idconstant
index_fields: List[str] = ['id']constant
fields: List[str] = ['id', 'title', 'body', 'aid', 'bid']constant
XPM Configdatamaestro_text.data.ir.stores.IKatClueWeb22DocumentStore(*, id, count, file_access, path)

Bases: LZ4DocumentStore

id: str

The unique (sub-)dataset ID

count: int

Number of documents

file_access: FileAccess = FileAccess.MMAP

How to access the file collection (might not have any impact, depends on the docstore)

path: path
lookup_field: str = idconstant
index_fields: List[str] = ['id']constant
XPM Configdatamaestro_text.datasets.irds.data.Documents(*, irds, id, count, file_access)

Bases: DocumentStore, IRDSId

irds: str

The id to load the dataset from ir_datasets

id: str

The unique (sub-)dataset ID

count: int

Number of documents

file_access: FileAccess = FileAccess.MMAP

How to access the file collection (might not have any impact, depends on the docstore)

Assessments

XPM Configdatamaestro_text.data.ir.AdhocAssessments(*, id)

Bases: Base, ABC

Ad-hoc assessments (qrels)

id: str

The unique (sub-)dataset ID

iter() Iterator[AdhocAssessedTopic]

Returns an iterator over assessments

XPM Configdatamaestro_text.data.ir.trec.TrecAdhocAssessments(*, id, path)

Bases: AdhocAssessments

id: str

The unique (sub-)dataset ID

path: path
XPM Configdatamaestro_text.datasets.irds.data.AdhocAssessments(*, irds, id)

Bases: AdhocAssessments, IRDSId

irds: str

The id to load the dataset from ir_datasets

id: str

The unique (sub-)dataset ID

class datamaestro_text.data.ir.AdhocAssessedTopic(topic_id: str, assessments: List[AdhocAssessment])

Bases: object

class datamaestro_text.data.ir.AdhocAssessment(doc_id: str)

Bases: object

Runs

XPM Configdatamaestro_text.data.ir.AdhocRun(*, id)

Bases: Base

IR adhoc run

id: str

The unique (sub-)dataset ID

XPM Configdatamaestro_text.data.ir.csv.AdhocRunWithText(*, id, path, separator)

Bases: AdhocRun

(qid, doc.id, query, passage)

id: str

The unique (sub-)dataset ID

path: path
separator: str
XPM Configdatamaestro_text.data.ir.trec.TrecAdhocRun(*, id, path)

Bases: AdhocRun

id: str

The unique (sub-)dataset ID

path: path
XPM Configdatamaestro_text.datasets.irds.data.AdhocRun(*, irds, id)

Bases: AdhocRun, IRDSId

irds: str

The id to load the dataset from ir_datasets

id: str

The unique (sub-)dataset ID

Results

XPM Configdatamaestro_text.data.ir.AdhocResults(*, id)

Bases: Base

id: str

The unique (sub-)dataset ID

XPM Configdatamaestro_text.data.ir.trec.TrecAdhocResults(*, id, metrics, results, detailed)

Bases: AdhocResults

Adhoc results (TREC format)

id: str

The unique (sub-)dataset ID

metrics: List[datamaestro_text.data.ir.Measure]

List of reported metrics

results: path

Main results

detailed: path

Results per topic (if any)

get_results() Dict[str, float]

Returns the results as a dictionary {metric_name: value}

Evaluation

XPM Configdatamaestro_text.data.ir.Measure

Bases: Config

An Information Retrieval measure

Reranking

XPM Configdatamaestro_text.data.ir.RerankAdhoc(*, id, documents, topics, assessments, run)

Bases: Adhoc

Re-ranking ad-hoc task based on an existing run

id: str

The unique (sub-)dataset ID

documents: datamaestro_text.data.ir.Documents

The set of documents

topics: datamaestro_text.data.ir.Topics

The set of topics

assessments: datamaestro_text.data.ir.AdhocAssessments

The set of assessments (for each topic)

run: datamaestro_text.data.ir.AdhocRun

The run to re-rank

Document Index

XPM Configdatamaestro_text.data.ir.DocumentStore(*, id, count, file_access)

Bases: Documents

A document store

A document store can - match external/internal ID - return the document content - return the number of documents

id: str

The unique (sub-)dataset ID

count: int

Number of documents

file_access: FileAccess = FileAccess.MMAP

How to access the file collection (might not have any impact, depends on the docstore)

docid_internal2external(docid: int)

Converts an internal collection ID (integer) to an external ID

document_ext(docid: str) Record

Returns a document given its external ID

document_int(internal_docid: int) Record

Returns a document given its internal ID

property documentcount

Returns the number of terms in the index

iter_sample(randint: Callable[[int], int] | None) Iterator[Record]

Sample documents from the dataset

XPM Configdatamaestro_text.data.ir.AdhocIndex(*, id, count, file_access)

Bases: DocumentStore

An index can be used to retrieve documents based on terms

id: str

The unique (sub-)dataset ID

count: int

Number of documents

file_access: FileAccess = FileAccess.MMAP

How to access the file collection (might not have any impact, depends on the docstore)

term_df(term: str)

Returns the document frequency

property termcount

Returns the number of terms in the index

Training triplets

XPM Configdatamaestro_text.data.ir.TrainingTriplets(*, id)

Bases: Base, ABC

Triplet for training IR systems: query / query ID, positive document, negative document

id: str

The unique (sub-)dataset ID

iter() Iterator[Tuple[Record, Record, Record]]

Returns an iterator over (topic, document 1, document) triplets

XPM Configdatamaestro_text.data.ir.PairwiseSampleDataset(*, id)

Bases: Base, ABC

Datasets where each record is a query with positive and negative samples

id: str

The unique (sub-)dataset ID

XPM Configdatamaestro_text.data.ir.TrainingTripletsLines(*, id, sep, path, doc_ids, topic_ids)

Bases: TrainingTriplets

Training triplets with one line per triple (query texts)

id: str

The unique (sub-)dataset ID

sep: str
path: path
doc_ids: bool

True if we have documents IDs

topic_ids: bool

True if we have query IDs

XPM Configdatamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset(*, id, repo_id, name, data_files, split, streaming, local_path, ids, query_id, pos_id, neg_id)

Bases: HuggingFaceDataset, PairwiseSampleDataset

Triplet for training IR systems: query / query ID, positive document, negative document

id: str

The unique (sub-)dataset ID

repo_id: str

The HuggingFace repository id (e.g. user/dataset).

name: str

HuggingFace dataset name (a.k.a. config).

data_files: str

Specific data files to load.

split: str

Dataset split to load.

streaming: bool = False

When True, load the dataset in streaming mode — no local cache.

local_path: path

If set, load from this local mirror instead of the HuggingFace Hub. Meta because the logical dataset is the same regardless of where the bytes come from.

ids: bool

True if the triplet is made of IDs, False otherwise

query_id: str = qid

The name of the field containing the query ID

pos_id: str = pos

The name of the field containing the positive samples

neg_id: str = neg

The name of the field containing the negative samples

XPM Configdatamaestro_text.datasets.irds.data.TrainingTriplets(*, irds, id)

Bases: TrainingTriplets, IRDSId

Training triplets from IR Dataset

irds: str

The id to load the dataset from ir_datasets

id: str

The unique (sub-)dataset ID

Transforms

XPM Configdatamaestro_text.transforms.ir.StoreTrainingTripletTopicAdapter(*, id, store, data)

Bases: TrainingTriplets

Retrieve an adhoc topic text from a topic store (given the topic ID)

id: str
store: datamaestro_text.data.ir.TopicsStore

The topic store to use

data: datamaestro_text.data.ir.TrainingTriplets

Input data

XPM Configdatamaestro_text.transforms.ir.StoreTrainingTripletDocumentAdapter(*, id, store, data)

Bases: TrainingTriplets

Transforms training triplets to add the document text from a document store

id: str
store: datamaestro_text.data.ir.DocumentStore

The topic store to use

data: datamaestro_text.data.ir.TrainingTriplets

Input data

XPM Taskdatamaestro_text.transforms.ir.ShuffledTrainingTripletsLines(*, data, doc_ids, topic_ids, seed, compressed, sample_rate, sample_max)

Bases: Task

Submit type: Any

Shuffle a set of training triplets

data: datamaestro_text.data.ir.TrainingTriplets

Input data

path: pathgenerated

Output path

doc_ids: bool

Whether to use document ids

topic_ids: bool

True if we have query IDs

seed: int

The random seed

compressed: bool = True

Compress the output

sample_rate: float = 1.0

Sampling rate - set to 1 to keep all the samples

sample_max: int = 0

Maximum number of samples

tmp_path: pathgenerated

Path where temporary files will be stored