Information Retrieval API¶
Collection¶
- XPM Configdatamaestro_text.data.ir.Adhoc(*, id, documents, topics, assessments)¶
An Adhoc IR collection with documents, topics and their assessments
- id: str¶
The unique dataset ID
- documents: datamaestro_text.data.ir.AdhocDocuments¶
The set of documents
- topics: datamaestro_text.data.ir.AdhocTopics¶
The set of topics
- assessments: datamaestro_text.data.ir.AdhocAssessments¶
The set of assessments (for each topic)
Topics¶
- XPM Configdatamaestro_text.data.ir.AdhocTopics(*, id)¶
- id: str¶
The unique dataset ID
- iter() Iterator[AdhocTopic] ¶
Returns an iterator over topics
- XPM Configdatamaestro_text.data.ir.csv.AdhocTopics(*, id, separator, path)¶
Pairs of query id - query using a separator
- id: str¶
The unique dataset ID
- separator: str¶
- path: Path¶
- class datamaestro_text.data.ir.AdhocTopic(qid: str, text: str, metadata: Dict[str, str])¶
The most generic topic: an ID with some text
Documents¶
- XPM Configdatamaestro_text.data.ir.AdhocDocuments(*, id, count)¶
A set of documents with identifiers
- id: str¶
The unique dataset ID
- count: int¶
Number of documents
Assessments¶
- XPM Configdatamaestro_text.data.ir.AdhocAssessments(*, id)¶
Ad-hoc assessements (qrels)
- id: str¶
The unique dataset ID
- id: str¶
The unique dataset ID
- iter() Iterator[AdhocAssessedTopic] ¶
Returns an iterator over assessments
- class datamaestro_text.data.ir.AdhocAssessedTopic(qid: str, assessments: List[datamaestro_text.data.ir.AdhocAssessment])¶
- class datamaestro_text.data.ir.AdhocAssessment(docno: str, rel: float)¶
Adhoc assessments associate a document ID with a relevance
Runs¶
Results¶
- XPM Configdatamaestro_text.data.ir.trec.TrecAdhocResults(*, id, metrics, results, detailed)¶
Adhoc results (TREC format)
- id: str¶
The unique dataset ID
- metrics: List[datamaestro_text.data.ir.Measure]¶
List of reported metrics
- results: Path¶
Main results
- detailed: Path¶
Results per topic (if any)
- get_results() Dict[str, float] ¶
Returns the results as a dictionary {metric_name: value}
Reranking¶
- XPM Configdatamaestro_text.data.ir.RerankAdhoc(*, id, documents, topics, assessments, run)¶
Re-ranking ad-hoc task based on an existing run
- id: str¶
The unique dataset ID
- documents: datamaestro_text.data.ir.AdhocDocuments¶
The set of documents
- topics: datamaestro_text.data.ir.AdhocTopics¶
The set of topics
- assessments: datamaestro_text.data.ir.AdhocAssessments¶
The set of assessments (for each topic)
- run: datamaestro_text.data.ir.AdhocRun¶
The run to re-rank
Document Index¶
- XPM Configdatamaestro_text.data.ir.AdhocDocumentStore(*, id, count)¶
A document store
A document store can - match external/internal ID - return the document content - return the number of documents
- id: str¶
The unique dataset ID
- count: int¶
Number of documents
- docid_internal2external(docid: int)¶
Converts an internal collection ID (integer) to an external ID
- document(internal_docid: int) AdhocDocument ¶
Returns a document given its internal ID
- document_text(docid: str) str ¶
Returns the text of the document given its id
- property documentcount¶
Returns the number of terms in the index
- iter_sample(randint: Optional[Callable[[int], int]]) Iterator[AdhocDocument] ¶
Sample documents from the dataset
Training triplets¶
- XPM Configdatamaestro_text.data.ir.TrainingTriplets(*, id, ids)¶
Triplet for training IR systems: query / query ID, positive document, negative document
- id: str¶
The unique dataset ID
- ids: bool¶
- XPM Configdatamaestro_text.data.ir.PairwiseSampleDataset(*, id, ids)¶
Datasets where each record is a query with positive and negative samples
- id: str¶
The unique dataset ID
- ids: bool¶
Whether data are texts or IDs
- XPM Configdatamaestro_text.data.ir.TrainingTripletsLines(*, id, ids, sep, path)¶
Training triplets with one line per triple (text only)
- id: str¶
The unique dataset ID
- ids: bool¶
- sep: str¶
- path: Path¶
- XPM Configdatamaestro_text.data.ir.csv.TrainingTriplets(*, id, path, separator)¶
Training triplets (full text)
- id: str¶
The unique dataset ID
- ids: bool = Trueconstant¶
- path: Path¶
- separator: str¶
- XPM Configdatamaestro_text.data.ir.csv.TrainingTripletsID(*, id, sep, path, separator, documents, topics)¶
Training triplets (query/document IDs only)
- id: str¶
The unique dataset ID
- ids: bool = Trueconstant¶
Whether documents are IDs or full text
- sep: str¶
- path: Path¶
- separator: str¶
Field separator
- documents: datamaestro_text.data.ir.AdhocDocuments¶
The documents
- topics: datamaestro_text.data.ir.AdhocTopics¶
The topics
- XPM Configdatamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset(*, id, ids, repo_id, data_files, split, query_id, pos_id, neg_id)¶
Triplet for training IR systems: query / query ID, positive document, negative document
- id: str¶
The unique dataset ID
- ids: bool¶
True if the triplet is made of IDs, False otherwise
- repo_id: str¶
- data_files: str¶
- split: str¶
- query_id: str = qid¶
The name of the field containing the query ID
- pos_id: str = pos¶
The name of the field containing the positive samples
- neg_id: str = neg¶
The name of the field containing the negative samples