Information Retrieval API
This module provides data types for Information Retrieval datasets and experiments.
The core abstractions are:
Documents - Collections of documents to be searched
Topics - Queries or information needs
Assessments - Relevance judgments (qrels) linking topics to relevant documents
Adhoc - A complete IR test collection combining documents, topics, and assessments
For training neural rankers:
TrainingTriplets - Training data as (query, positive_doc, negative_doc) triplets
PairwiseSampleDataset - General pairwise training data
Data objects
- class datamaestro_text.data.ir.base.IDItem(id: str)
Bases:
Item,ABCA topic/document with an external ID
- class datamaestro_text.data.ir.base.InternalIDItem(id: int)
Bases:
Item,ABCA topic/document with an internal ID
- class datamaestro_text.data.ir.base.ScoredItem(score: float)
Bases:
ItemA score associated with the document
- score: float
A retrieval score associated with this record (e.g. of the first-stage retriever)
- class datamaestro_text.data.ir.base.SimpleTextItem(text: str)
Bases:
TextItemA topic/document with a text record
Collection
- XPM Configdatamaestro_text.data.ir.Adhoc(*, id, documents, topics, assessments)
Bases:
BaseAn Adhoc IR collection with documents, topics and their assessments
- id: str
The unique (sub-)dataset ID
- documents: datamaestro_text.data.ir.Documents
The set of documents
- topics: datamaestro_text.data.ir.Topics
The set of topics
- assessments: datamaestro_text.data.ir.AdhocAssessments
The set of assessments (for each topic)
- XPM Configdatamaestro_text.datasets.irds.data.Adhoc(*, irds, id, documents, topics, assessments)
-
- irds: str
The id to load the dataset from ir_datasets
- id: str
The unique (sub-)dataset ID
- documents: datamaestro_text.data.ir.Documents
The set of documents
- topics: datamaestro_text.data.ir.Topics
The set of topics
- assessments: datamaestro_text.data.ir.AdhocAssessments
The set of assessments (for each topic)
Topics
- XPM Configdatamaestro_text.data.ir.Topics(*, id)
Bases:
Base,ABCA set of topics with associated IDs
- id: str
The unique (sub-)dataset ID
- count() int | None
Returns the number of topics if known
- XPM Configdatamaestro_text.data.ir.csv.Topics(*, id, path, separator)
Bases:
TopicsPairs of query id - query using a separator
- id: str
The unique (sub-)dataset ID
- path: path
- separator: str
- XPM Configdatamaestro_text.data.ir.TopicsStore(*, id)
Bases:
TopicsAdhoc topics store
- id: str
The unique (sub-)dataset ID
- XPM Configdatamaestro_text.transforms.ir.TopicWrapper
Bases:
Config,ABCModify topics on the fly using a topic wrapper
Dataset-specific Topics
- XPM Configdatamaestro_text.data.ir.trec.TrecTopics(*, id, path, parts)
Bases:
Topics- id: str
The unique (sub-)dataset ID
- path: path
- parts: List[str]
- XPM Configdatamaestro_text.data.ir.cord19.Topics(*, id, path)
-
XML format used in Adhoc topics
- id: str
The unique (sub-)dataset ID
- path: path
The path of the file
- XPM Configdatamaestro_text.datasets.irds.data.Topics(*, irds, id)
Bases:
TopicsStore,IRDSId- irds: str
The id to load the dataset from ir_datasets
- id: str
The unique (sub-)dataset ID
Documents
- XPM Configdatamaestro_text.data.ir.Documents(*, id, count)
Bases:
BaseA set of documents with identifiers
See IR Datasets for the list of query classes
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- property documentcount
Returns the number of terms in the index
- iter_ids() Iterator[str]
Iterates over document ids
By default, use iter_documents, which is not really efficient.
- XPM Configdatamaestro_text.data.ir.csv.Documents(*, id, count, path, separator)
Bases:
DocumentsOne line per document, format pid<SEP>text
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- path: path
- separator: str
- XPM Configdatamaestro_text.datasets.irds.data.LZ4DocumentStore(*, id, count, file_access, path, lookup_field)
Bases:
DocumentStore,ABCA LZ4-based document store
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
- lookup_field: str
- XPM Configdatamaestro_text.datasets.irds.data.LZ4JSONLDocumentStore(*, id, count, file_access, path, lookup_field)
Bases:
LZ4DocumentStorejson-l based document store
Each line is of the form
`json { "id": "...", "text": "..." } `- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
- lookup_field: str
IR-Datasets Base
- XPM Configdatamaestro_text.datasets.irds.data.IRDSId(*, irds)
Bases:
Config- irds: str
The id to load the dataset from ir_datasets
Dataset-specific documents
- XPM Configdatamaestro_text.data.ir.cord19.Documents(*, id, path, delimiter, ignore, names_row, count)
-
- id: str
The unique (sub-)dataset ID
- path: path
The path of the file
- delimiter: str = ,
- ignore: int = 0
- names_row: int = -1
- count: int
Number of documents
- XPM Configdatamaestro_text.data.ir.trec.TipsterCollection(*, id, count, path)
Bases:
Documents- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- path: path
- XPM Configdatamaestro_text.data.ir.stores.OrConvQADocumentStore(*, id, count, file_access, path)
Bases:
LZ4DocumentStore- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
- lookup_field: str = idconstant
- index_fields: List[str] = ['id']constant
- fields: List[str] = ['id', 'title', 'body', 'aid', 'bid']constant
- XPM Configdatamaestro_text.data.ir.stores.IKatClueWeb22DocumentStore(*, id, count, file_access, path)
Bases:
LZ4DocumentStore- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- path: path
- lookup_field: str = idconstant
- index_fields: List[str] = ['id']constant
- XPM Configdatamaestro_text.datasets.irds.data.Documents(*, irds, id, count, file_access)
Bases:
DocumentStore,IRDSId- irds: str
The id to load the dataset from ir_datasets
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
Assessments
- XPM Configdatamaestro_text.data.ir.AdhocAssessments(*, id)
Bases:
Base,ABCAd-hoc assessments (qrels)
- id: str
The unique (sub-)dataset ID
- iter() Iterator[AdhocAssessedTopic]
Returns an iterator over assessments
- XPM Configdatamaestro_text.data.ir.trec.TrecAdhocAssessments(*, id, path)
Bases:
AdhocAssessments- id: str
The unique (sub-)dataset ID
- path: path
- XPM Configdatamaestro_text.datasets.irds.data.AdhocAssessments(*, irds, id)
Bases:
AdhocAssessments,IRDSId- irds: str
The id to load the dataset from ir_datasets
- id: str
The unique (sub-)dataset ID
- class datamaestro_text.data.ir.AdhocAssessedTopic(topic_id: str, assessments: List[AdhocAssessment])
Bases:
object
- class datamaestro_text.data.ir.AdhocAssessment(doc_id: str)
Bases:
object
Runs
- XPM Configdatamaestro_text.data.ir.AdhocRun(*, id)
Bases:
BaseIR adhoc run
- id: str
The unique (sub-)dataset ID
- XPM Configdatamaestro_text.data.ir.csv.AdhocRunWithText(*, id, path, separator)
Bases:
AdhocRun(qid, doc.id, query, passage)
- id: str
The unique (sub-)dataset ID
- path: path
- separator: str
Results
- XPM Configdatamaestro_text.data.ir.AdhocResults(*, id)
Bases:
Base- id: str
The unique (sub-)dataset ID
- XPM Configdatamaestro_text.data.ir.trec.TrecAdhocResults(*, id, metrics, results, detailed)
Bases:
AdhocResultsAdhoc results (TREC format)
- id: str
The unique (sub-)dataset ID
- metrics: List[datamaestro_text.data.ir.Measure]
List of reported metrics
- results: path
Main results
- detailed: path
Results per topic (if any)
- get_results() Dict[str, float]
Returns the results as a dictionary {metric_name: value}
Evaluation
- XPM Configdatamaestro_text.data.ir.Measure
Bases:
ConfigAn Information Retrieval measure
Reranking
- XPM Configdatamaestro_text.data.ir.RerankAdhoc(*, id, documents, topics, assessments, run)
Bases:
AdhocRe-ranking ad-hoc task based on an existing run
- id: str
The unique (sub-)dataset ID
- documents: datamaestro_text.data.ir.Documents
The set of documents
- topics: datamaestro_text.data.ir.Topics
The set of topics
- assessments: datamaestro_text.data.ir.AdhocAssessments
The set of assessments (for each topic)
- run: datamaestro_text.data.ir.AdhocRun
The run to re-rank
Document Index
- XPM Configdatamaestro_text.data.ir.DocumentStore(*, id, count, file_access)
Bases:
DocumentsA document store
A document store can - match external/internal ID - return the document content - return the number of documents
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- docid_internal2external(docid: int)
Converts an internal collection ID (integer) to an external ID
- property documentcount
Returns the number of terms in the index
- XPM Configdatamaestro_text.data.ir.AdhocIndex(*, id, count, file_access)
Bases:
DocumentStoreAn index can be used to retrieve documents based on terms
- id: str
The unique (sub-)dataset ID
- count: int
Number of documents
- file_access: FileAccess = FileAccess.MMAP
How to access the file collection (might not have any impact, depends on the docstore)
- term_df(term: str)
Returns the document frequency
- property termcount
Returns the number of terms in the index
Training triplets
- XPM Configdatamaestro_text.data.ir.TrainingTriplets(*, id)
Bases:
Base,ABCTriplet for training IR systems: query / query ID, positive document, negative document
- id: str
The unique (sub-)dataset ID
- XPM Configdatamaestro_text.data.ir.PairwiseSampleDataset(*, id)
Bases:
Base,ABCDatasets where each record is a query with positive and negative samples
- id: str
The unique (sub-)dataset ID
- XPM Configdatamaestro_text.data.ir.TrainingTripletsLines(*, id, sep, path, doc_ids, topic_ids)
Bases:
TrainingTripletsTraining triplets with one line per triple (query texts)
- id: str
The unique (sub-)dataset ID
- sep: str
- path: path
- doc_ids: bool
True if we have documents IDs
- topic_ids: bool
True if we have query IDs
- XPM Configdatamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset(*, id, repo_id, name, data_files, split, streaming, local_path, ids, query_id, pos_id, neg_id)
Bases:
HuggingFaceDataset,PairwiseSampleDatasetTriplet for training IR systems: query / query ID, positive document, negative document
- id: str
The unique (sub-)dataset ID
- repo_id: str
The HuggingFace repository id (e.g.
user/dataset).
- name: str
HuggingFace dataset
name(a.k.a. config).
- data_files: str
Specific data files to load.
- split: str
Dataset split to load.
- streaming: bool = False
When True, load the dataset in streaming mode — no local cache.
- local_path: path
If set, load from this local mirror instead of the HuggingFace Hub.
Metabecause the logical dataset is the same regardless of where the bytes come from.
- ids: bool
True if the triplet is made of IDs, False otherwise
- query_id: str = qid
The name of the field containing the query ID
- pos_id: str = pos
The name of the field containing the positive samples
- neg_id: str = neg
The name of the field containing the negative samples
- XPM Configdatamaestro_text.datasets.irds.data.TrainingTriplets(*, irds, id)
Bases:
TrainingTriplets,IRDSIdTraining triplets from IR Dataset
- irds: str
The id to load the dataset from ir_datasets
- id: str
The unique (sub-)dataset ID
Transforms
- XPM Configdatamaestro_text.transforms.ir.StoreTrainingTripletTopicAdapter(*, id, store, data)
Bases:
TrainingTripletsRetrieve an adhoc topic text from a topic store (given the topic ID)
- id: str
- store: datamaestro_text.data.ir.TopicsStore
The topic store to use
- data: datamaestro_text.data.ir.TrainingTriplets
Input data
- XPM Configdatamaestro_text.transforms.ir.StoreTrainingTripletDocumentAdapter(*, id, store, data)
Bases:
TrainingTripletsTransforms training triplets to add the document text from a document store
- id: str
- store: datamaestro_text.data.ir.DocumentStore
The topic store to use
- data: datamaestro_text.data.ir.TrainingTriplets
Input data
- XPM Taskdatamaestro_text.transforms.ir.ShuffledTrainingTripletsLines(*, data, doc_ids, topic_ids, seed, compressed, sample_rate, sample_max)
Bases:
TaskSubmit type:
AnyShuffle a set of training triplets
- data: datamaestro_text.data.ir.TrainingTriplets
Input data
- path: pathgenerated
Output path
- doc_ids: bool
Whether to use document ids
- topic_ids: bool
True if we have query IDs
- seed: int
The random seed
- compressed: bool = True
Compress the output
- sample_rate: float = 1.0
Sampling rate - set to 1 to keep all the samples
- sample_max: int = 0
Maximum number of samples
- tmp_path: pathgenerated
Path where temporary files will be stored