Information Retrieval API
Collection
- XPM Configdatamaestro_text.data.ir.Adhoc(*, id, documents, topics, assessments)
Bases:
Base
An Adhoc IR collection with documents, topics and their assessments
- id: str
The unique dataset ID
- documents: datamaestro_text.data.ir.AdhocDocuments
The set of documents
- topics: datamaestro_text.data.ir.AdhocTopics
The set of topics
- assessments: datamaestro_text.data.ir.AdhocAssessments
The set of assessments (for each topic)
Topics
- XPM Configdatamaestro_text.data.ir.AdhocTopics(*, id)
Bases:
Base
A set of topics with associated IDs
- id: str
The unique dataset ID
- count() int | None
Returns the number of topics if known
- iter() Iterator[AdhocTopic]
Returns an iterator over topics
- XPM Configdatamaestro_text.data.ir.csv.AdhocTopics(*, id, separator, path)
Bases:
AdhocTopics
Pairs of query id - query using a separator
- id: str
The unique dataset ID
- separator: str
- path: Path
- class datamaestro_text.data.ir.AdhocTopic(qid: str, text: str, metadata: Dict[str, str])
Bases:
object
The most generic topic: an ID with some text
Documents
- class datamaestro_text.data.ir.AdhocDocument(docid: str, text: str, internal_docid: int | None = None)
Bases:
object
A document with an identifier
- XPM Configdatamaestro_text.data.ir.AdhocDocuments(*, id, count)
Bases:
Base
A set of documents with identifiers
- id: str
The unique dataset ID
- count: int
Number of documents
- property documentcount
Returns the number of terms in the index
- iter_ids() Iterator[str]
Iterates over document ids
By default, use iter_documents, which is not really efficient.
Assessments
- XPM Configdatamaestro_text.data.ir.AdhocAssessments(*, id)
Bases:
Base
Ad-hoc assessements (qrels)
- id: str
The unique dataset ID
- id: str
The unique dataset ID
- iter() Iterator[AdhocAssessedTopic]
Returns an iterator over assessments
- XPM Configdatamaestro_text.data.ir.trec.TrecAdhocAssessments
Bases:
AdhocAssessments
- id: str
The unique dataset ID
- class datamaestro_text.data.ir.AdhocAssessedTopic(qid: str, assessments: List[datamaestro_text.data.ir.AdhocAssessment])
Bases:
object
- class datamaestro_text.data.ir.AdhocAssessment(docno: str, rel: float)
Bases:
object
Adhoc assessments associate a document ID with a relevance
Runs
- XPM Configdatamaestro_text.data.ir.AdhocRun(*, id)
Bases:
Base
IR adhoc run
- id: str
The unique dataset ID
Results
- XPM Configdatamaestro_text.data.ir.trec.TrecAdhocResults(*, id, metrics, results, detailed)
Bases:
AdhocResults
Adhoc results (TREC format)
- id: str
The unique dataset ID
- metrics: List[datamaestro_text.data.ir.Measure]
List of reported metrics
- results: Path
Main results
- detailed: Path
Results per topic (if any)
- get_results() Dict[str, float]
Returns the results as a dictionary {metric_name: value}
Reranking
- XPM Configdatamaestro_text.data.ir.RerankAdhoc(*, id, documents, topics, assessments, run)
Bases:
Adhoc
Re-ranking ad-hoc task based on an existing run
- id: str
The unique dataset ID
- documents: datamaestro_text.data.ir.AdhocDocuments
The set of documents
- topics: datamaestro_text.data.ir.AdhocTopics
The set of topics
- assessments: datamaestro_text.data.ir.AdhocAssessments
The set of assessments (for each topic)
- run: datamaestro_text.data.ir.AdhocRun
The run to re-rank
Document Index
- XPM Configdatamaestro_text.data.ir.AdhocDocumentStore(*, id, count)
Bases:
AdhocDocuments
A document store
A document store can - match external/internal ID - return the document content - return the number of documents
- id: str
The unique dataset ID
- count: int
Number of documents
- docid_internal2external(docid: int)
Converts an internal collection ID (integer) to an external ID
- document(internal_docid: int) AdhocDocument
Returns a document given its internal ID
- document_text(docid: str) str
Returns the text of the document given its id
- property documentcount
Returns the number of terms in the index
- iter_sample(randint: Callable[[int], int] | None) Iterator[AdhocDocument]
Sample documents from the dataset
- XPM Configdatamaestro_text.data.ir.AdhocIndex(*, id, count)
Bases:
AdhocDocumentStore
An index can be used to retrieve documents based on terms
- id: str
The unique dataset ID
- count: int
Number of documents
- term_df(term: str)
Returns the document frequency
- property termcount
Returns the number of terms in the index
Training triplets
- XPM Configdatamaestro_text.data.ir.TrainingTriplets(*, id, ids)
Bases:
Base
Triplet for training IR systems: query / query ID, positive document, negative document
- id: str
The unique dataset ID
- ids: bool
- XPM Configdatamaestro_text.data.ir.PairwiseSampleDataset(*, id, ids)
Bases:
Base
Datasets where each record is a query with positive and negative samples
- id: str
The unique dataset ID
- ids: bool
Whether data are texts or IDs
- XPM Configdatamaestro_text.data.ir.TrainingTripletsLines(*, id, ids, sep, path)
Bases:
TrainingTriplets
Training triplets with one line per triple (text only)
- id: str
The unique dataset ID
- ids: bool
- sep: str
- path: Path
- XPM Configdatamaestro_text.data.ir.csv.TrainingTriplets(*, id, path, separator)
Bases:
TrainingTriplets
Training triplets (full text)
- id: str
The unique dataset ID
- ids: bool = Trueconstant
- path: Path
- separator: str
- XPM Configdatamaestro_text.data.ir.csv.TrainingTripletsID(*, id, sep, path, separator, documents, topics)
Bases:
TrainingTripletsLines
Training triplets (query/document IDs only)
- id: str
The unique dataset ID
- ids: bool = Trueconstant
Whether documents are IDs or full text
- sep: str
- path: Path
- separator: str
Field separator
- documents: datamaestro_text.data.ir.AdhocDocuments
The documents
- topics: datamaestro_text.data.ir.AdhocTopics
The topics
- XPM Configdatamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset(*, id, ids, repo_id, data_files, split, query_id, pos_id, neg_id)
Bases:
HuggingFaceDataset
,PairwiseSampleDataset
Triplet for training IR systems: query / query ID, positive document, negative document
- id: str
The unique dataset ID
- ids: bool
True if the triplet is made of IDs, False otherwise
- repo_id: str
- data_files: str
- split: str
- query_id: str = qid
The name of the field containing the query ID
- pos_id: str = pos
The name of the field containing the positive samples
- neg_id: str = neg
The name of the field containing the negative samples