Information Retrieval API
Data objects
- class datamaestro_text.data.ir.base.BaseRecord(*items: Dict[Type[T], T] | T, no_check=False)
Bases:
Record
- class datamaestro_text.data.ir.base.DocumentRecord(*items: Dict[Type[T], T] | T, no_check=False)
Bases:
BaseRecord
Document record
- class datamaestro_text.data.ir.base.GenericDocumentRecord(*items: Dict[Type[T], T] | T, no_check=False)
Bases:
DocumentRecord
- itemtypes: ClassVar[List[Type[T]]] = [<class 'datamaestro_text.data.ir.base.IDItem'>, <class 'datamaestro_text.data.ir.base.TextItem'>]
For specific records, this is the list of types
- class datamaestro_text.data.ir.base.GenericTopicRecord(*items: Dict[Type[T], T] | T, no_check=False)
Bases:
TopicRecord
- itemtypes: ClassVar[List[Type[T]]] = [<class 'datamaestro_text.data.ir.base.IDItem'>, <class 'datamaestro_text.data.ir.base.TextItem'>]
For specific records, this is the list of types
- class datamaestro_text.data.ir.base.IDDocumentRecord(*items: Dict[Type[T], T] | T, no_check=False)
Bases:
DocumentRecord
- itemtypes: ClassVar[List[Type[T]]] = [<class 'datamaestro_text.data.ir.base.IDItem'>]
For specific records, this is the list of types
- class datamaestro_text.data.ir.base.IDItem(id: str)
Bases:
Item
,ABC
A topic/document with an external ID
- class datamaestro_text.data.ir.base.IDTopicRecord(*items: Dict[Type[T], T] | T, no_check=False)
Bases:
TopicRecord
- itemtypes: ClassVar[List[Type[T]]] = [<class 'datamaestro_text.data.ir.base.IDItem'>]
For specific records, this is the list of types
- class datamaestro_text.data.ir.base.InternalIDItem(id: int)
Bases:
Item
,ABC
A topic/document with an internal ID
- class datamaestro_text.data.ir.base.ScoredItem(score: float)
Bases:
Item
A score associated with the document
- score: float
A retrieval score associated with this record (e.g. of the first-stage retriever)
- class datamaestro_text.data.ir.base.SimpleTextDocumentRecord(*items: Dict[Type[T], T] | T, no_check=False)
Bases:
DocumentRecord
- itemtypes: ClassVar[List[Type[T]]] = [<class 'datamaestro_text.data.ir.base.SimpleTextItem'>]
For specific records, this is the list of types
- class datamaestro_text.data.ir.base.SimpleTextItem(text: str)
Bases:
TextItem
A topic/document with a text record
- class datamaestro_text.data.ir.base.SimpleTextTopicRecord(*items: Dict[Type[T], T] | T, no_check=False)
Bases:
TopicRecord
- itemtypes: ClassVar[List[Type[T]]] = [<class 'datamaestro_text.data.ir.base.SimpleTextItem'>]
For specific records, this is the list of types
- class datamaestro_text.data.ir.base.TextItem
Bases:
Item
,ABC
- abstract property text: str
Returns the text
- class datamaestro_text.data.ir.base.TopicRecord(*items: Dict[Type[T], T] | T, no_check=False)
Bases:
BaseRecord
Topic record
Collection
- XPM Configdatamaestro_text.data.ir.Adhoc(*, id, documents, topics, assessments)
Bases:
Base
Submit type:
datamaestro_text.data.ir.Adhoc
An Adhoc IR collection with documents, topics and their assessments
- id: str
The unique dataset ID
- documents: datamaestro_text.data.ir.Documents
The set of documents
- topics: datamaestro_text.data.ir.Topics
The set of topics
- assessments: datamaestro_text.data.ir.AdhocAssessments
The set of assessments (for each topic)
Topics
- XPM Configdatamaestro_text.data.ir.Topics(*, id)
Bases:
Base
,ABC
Submit type:
datamaestro_text.data.ir.Topics
A set of topics with associated IDs
- id: str
The unique dataset ID
- count() int | None
Returns the number of topics if known
- abstract iter() Iterator[TopicRecord]
Returns an iterator over topics
- XPM Configdatamaestro_text.data.ir.csv.Topics(*, id, separator, path)
Bases:
Topics
Submit type:
datamaestro_text.data.ir.csv.Topics
Pairs of query id - query using a separator
- id: str
The unique dataset ID
- separator: str
- path: Path
- XPM Configdatamaestro_text.transforms.ir.TopicWrapper
Bases:
Config
,ABC
Submit type:
datamaestro_text.transforms.ir.TopicWrapper
Modify topics on the fly using a topic wrapper
Documents
- XPM Configdatamaestro_text.data.ir.Documents(*, id, count)
Bases:
Base
Submit type:
datamaestro_text.data.ir.Documents
A set of documents with identifiers
See IR Datasets for the list of query classes
- id: str
The unique dataset ID
- count: int
Number of documents
- property documentcount
Returns the number of terms in the index
- iter_ids() Iterator[str]
Iterates over document ids
By default, use iter_documents, which is not really efficient.
- XPM Configdatamaestro_text.data.ir.cord19.Documents(*, id, path, delimiter, ignore, names_row, count)
-
Submit type:
datamaestro_text.data.ir.cord19.Documents
- id: str
The unique dataset ID
- path: Path
The path of the file
- delimiter: str = ,
- ignore: int = 0
- names_row: int = -1
- count: int
Number of documents
Assessments
- XPM Configdatamaestro_text.data.ir.AdhocAssessments(*, id)
Bases:
Base
,ABC
Submit type:
datamaestro_text.data.ir.AdhocAssessments
Ad-hoc assessments (qrels)
- id: str
The unique dataset ID
- id: str
The unique dataset ID
- iter() Iterator[AdhocAssessment]
Returns an iterator over assessments
- XPM Configdatamaestro_text.data.ir.trec.TrecAdhocAssessments
Bases:
AdhocAssessments
Submit type:
datamaestro_text.data.ir.trec.TrecAdhocAssessments
- id: str
The unique dataset ID
- class datamaestro_text.data.ir.AdhocAssessment(doc_id: str)
Bases:
object
Runs
- XPM Configdatamaestro_text.data.ir.AdhocRun(*, id)
Bases:
Base
Submit type:
datamaestro_text.data.ir.AdhocRun
IR adhoc run
- id: str
The unique dataset ID
- XPM Configdatamaestro_text.data.ir.csv.AdhocRunWithText(*, id, separator, path)
Bases:
AdhocRun
Submit type:
datamaestro_text.data.ir.csv.AdhocRunWithText
(qid, doc.id, query, passage)
- id: str
The unique dataset ID
- separator: str
- path: Path
- XPM Configdatamaestro_text.data.ir.trec.TrecAdhocRun(*, id, path)
Bases:
AdhocRun
Submit type:
datamaestro_text.data.ir.trec.TrecAdhocRun
- id: str
The unique dataset ID
- path: Path
Results
- XPM Configdatamaestro_text.data.ir.trec.TrecAdhocResults(*, id, metrics, results, detailed)
Bases:
AdhocResults
Submit type:
datamaestro_text.data.ir.trec.TrecAdhocResults
Adhoc results (TREC format)
- id: str
The unique dataset ID
- metrics: List[datamaestro_text.data.ir.Measure]
List of reported metrics
- results: Path
Main results
- detailed: Path
Results per topic (if any)
- get_results() Dict[str, float]
Returns the results as a dictionary {metric_name: value}
Reranking
- XPM Configdatamaestro_text.data.ir.RerankAdhoc(*, id, documents, topics, assessments, run)
Bases:
Adhoc
Submit type:
datamaestro_text.data.ir.RerankAdhoc
Re-ranking ad-hoc task based on an existing run
- id: str
The unique dataset ID
- documents: datamaestro_text.data.ir.Documents
The set of documents
- topics: datamaestro_text.data.ir.Topics
The set of topics
- assessments: datamaestro_text.data.ir.AdhocAssessments
The set of assessments (for each topic)
- run: datamaestro_text.data.ir.AdhocRun
The run to re-rank
Document Index
- XPM Configdatamaestro_text.data.ir.DocumentStore(*, id, count)
Bases:
Documents
Submit type:
datamaestro_text.data.ir.DocumentStore
A document store
A document store can - match external/internal ID - return the document content - return the number of documents
- id: str
The unique dataset ID
- count: int
Number of documents
- docid_internal2external(docid: int)
Converts an internal collection ID (integer) to an external ID
- property documentcount
Returns the number of terms in the index
- iter_sample(randint: Callable[[int], int] | None) Iterator[DocumentRecord]
Sample documents from the dataset
- XPM Configdatamaestro_text.data.ir.AdhocIndex(*, id, count)
Bases:
DocumentStore
Submit type:
datamaestro_text.data.ir.AdhocIndex
An index can be used to retrieve documents based on terms
- id: str
The unique dataset ID
- count: int
Number of documents
- term_df(term: str)
Returns the document frequency
- property termcount
Returns the number of terms in the index
Training triplets
- XPM Configdatamaestro_text.data.ir.TrainingTriplets(*, id)
Bases:
Base
,ABC
Submit type:
datamaestro_text.data.ir.TrainingTriplets
Triplet for training IR systems: query / query ID, positive document, negative document
- id: str
The unique dataset ID
- iter() Iterator[Tuple[TopicRecord, DocumentRecord, DocumentRecord]]
Returns an iterator over (topic, document 1, document) triplets
- XPM Configdatamaestro_text.data.ir.PairwiseSampleDataset(*, id)
Bases:
Base
,ABC
Submit type:
datamaestro_text.data.ir.PairwiseSampleDataset
Datasets where each record is a query with positive and negative samples
- id: str
The unique dataset ID
- XPM Configdatamaestro_text.data.ir.TrainingTripletsLines(*, id, sep, path, doc_ids, topic_ids)
Bases:
TrainingTriplets
Submit type:
datamaestro_text.data.ir.TrainingTripletsLines
Training triplets with one line per triple (query texts)
- id: str
The unique dataset ID
- sep: str
- path: Path
- doc_ids: bool
True if we have documents IDs
- topic_ids: bool
True if we have query IDs
- XPM Configdatamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset(*, id, repo_id, data_files, split, ids, query_id, pos_id, neg_id)
Bases:
HuggingFaceDataset
,PairwiseSampleDataset
Submit type:
datamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset
Triplet for training IR systems: query / query ID, positive document, negative document
- id: str
The unique dataset ID
- repo_id: str
- data_files: str
- split: str
- ids: bool
True if the triplet is made of IDs, False otherwise
- query_id: str = qid
The name of the field containing the query ID
- pos_id: str = pos
The name of the field containing the positive samples
- neg_id: str = neg
The name of the field containing the negative samples
Transforms
- XPM Configdatamaestro_text.transforms.ir.StoreTrainingTripletTopicAdapter(*, id, store, data)
Bases:
TrainingTriplets
Submit type:
datamaestro_text.transforms.ir.StoreTrainingTripletTopicAdapter
Retrieve an adhoc topic text from a topic store (given the topic ID)
- id: str
- store: datamaestro_text.data.ir.TopicsStore
The topic store to use
- data: datamaestro_text.data.ir.TrainingTriplets
Input data
- XPM Configdatamaestro_text.transforms.ir.StoreTrainingTripletDocumentAdapter(*, id, store, data)
Bases:
TrainingTriplets
Submit type:
datamaestro_text.transforms.ir.StoreTrainingTripletDocumentAdapter
Transforms training triplets to add the document text from a document store
- id: str
- store: datamaestro_text.data.ir.DocumentStore
The topic store to use
- data: datamaestro_text.data.ir.TrainingTriplets
Input data
- XPM Taskdatamaestro_text.transforms.ir.ShuffledTrainingTripletsLines(*, data, doc_ids, topic_ids, seed, compressed, sample_rate, sample_max)
Bases:
Task
Submit type:
Any
Shuffle a set of training triplets
- data: datamaestro_text.data.ir.TrainingTriplets
Input data
- path: Pathgenerated
Output path
- doc_ids: bool
Whether to use document ids
- topic_ids: bool
True if we have query IDs
- seed: int
The random seed
- compressed: bool = True
Compress the output
- sample_rate: float = 1.0
Sampling rate - set to 1 to keep all the samples
- sample_max: int = 0
Maximum number of samples
- tmp_path: Pathgenerated
Path where temporary files will be stored