Information Retrieval API
Data objects
- class datamaestro_text.data.ir.base.BaseHolder
Bases:
object
Base class for topics and documents
- class datamaestro_text.data.ir.base.Document
Bases:
BaseHolder
Base class for documents
- class datamaestro_text.data.ir.base.GenericDocument(id: str, text: str)
Bases:
TextAndIDHolder
,Document
Documents with ID and text
- class datamaestro_text.data.ir.base.GenericTopic(id: str, text: str)
Bases:
TextAndIDHolder
,Topic
- class datamaestro_text.data.ir.base.IDDocument(id: str)
-
Documents with ID
- class datamaestro_text.data.ir.base.IDHolder(id: str)
Bases:
object
Base data class for ID only data structures
- class datamaestro_text.data.ir.base.TextAndIDHolder(id: str, text: str)
Bases:
object
Base data class for ID and text data structures
- class datamaestro_text.data.ir.base.TextDocument(text: str)
Bases:
TextHolder
,Document
Documents with text
- class datamaestro_text.data.ir.base.TextHolder(text: str)
Bases:
object
Base data class for text only data structures
- class datamaestro_text.data.ir.base.TextTopic(text: str)
Bases:
TextHolder
,Topic
- class datamaestro_text.data.ir.base.Topic
Bases:
BaseHolder
Collection
- XPM Configdatamaestro_text.data.ir.Adhoc(*, id, documents, topics, assessments)
Bases:
Base
An Adhoc IR collection with documents, topics and their assessments
- id: str
The unique dataset ID
- documents: datamaestro_text.data.ir.Documents
The set of documents
- topics: datamaestro_text.data.ir.Topics
The set of topics
- assessments: datamaestro_text.data.ir.AdhocAssessments
The set of assessments (for each topic)
Topics
- XPM Configdatamaestro_text.data.ir.Topics(*, id)
Bases:
Base
A set of topics with associated IDs
- id: str
The unique dataset ID
- count() int | None
Returns the number of topics if known
- XPM Configdatamaestro_text.data.ir.csv.Topics(*, id, separator, path)
Bases:
Topics
Pairs of query id - query using a separator
- id: str
The unique dataset ID
- separator: str
- path: Path
- class datamaestro_text.data.ir.Topic
Bases:
BaseHolder
Documents
- class datamaestro_text.data.ir.Document
Bases:
BaseHolder
Base class for documents
- XPM Configdatamaestro_text.data.ir.Documents(*, id, count)
Bases:
Base
A set of documents with identifiers
See IR Datasets for the list of query classes
- id: str
The unique dataset ID
- count: int
Number of documents
- property documentcount
Returns the number of terms in the index
- iter_ids() Iterator[str]
Iterates over document ids
By default, use iter_documents, which is not really efficient.
Assessments
- XPM Configdatamaestro_text.data.ir.AdhocAssessments(*, id)
Bases:
Base
,ABC
Ad-hoc assessments (qrels)
- id: str
The unique dataset ID
- id: str
The unique dataset ID
- iter() Iterator[AdhocAssessment]
Returns an iterator over assessments
- XPM Configdatamaestro_text.data.ir.trec.TrecAdhocAssessments
Bases:
AdhocAssessments
- id: str
The unique dataset ID
- class datamaestro_text.data.ir.AdhocAssessment(doc_id: str)
Bases:
object
Runs
- XPM Configdatamaestro_text.data.ir.AdhocRun(*, id)
Bases:
Base
IR adhoc run
- id: str
The unique dataset ID
Results
- XPM Configdatamaestro_text.data.ir.trec.TrecAdhocResults(*, id, metrics, results, detailed)
Bases:
AdhocResults
Adhoc results (TREC format)
- id: str
The unique dataset ID
- metrics: List[datamaestro_text.data.ir.Measure]
List of reported metrics
- results: Path
Main results
- detailed: Path
Results per topic (if any)
- get_results() Dict[str, float]
Returns the results as a dictionary {metric_name: value}
Reranking
- XPM Configdatamaestro_text.data.ir.RerankAdhoc(*, id, documents, topics, assessments, run)
Bases:
Adhoc
Re-ranking ad-hoc task based on an existing run
- id: str
The unique dataset ID
- documents: datamaestro_text.data.ir.Documents
The set of documents
- topics: datamaestro_text.data.ir.Topics
The set of topics
- assessments: datamaestro_text.data.ir.AdhocAssessments
The set of assessments (for each topic)
- run: datamaestro_text.data.ir.AdhocRun
The run to re-rank
Document Index
- XPM Configdatamaestro_text.data.ir.DocumentStore(*, id, count)
Bases:
Documents
A document store
A document store can - match external/internal ID - return the document content - return the number of documents
- id: str
The unique dataset ID
- count: int
Number of documents
- docid_internal2external(docid: int)
Converts an internal collection ID (integer) to an external ID
- property documentcount
Returns the number of terms in the index
- XPM Configdatamaestro_text.data.ir.AdhocIndex(*, id, count)
Bases:
DocumentStore
An index can be used to retrieve documents based on terms
- id: str
The unique dataset ID
- count: int
Number of documents
- term_df(term: str)
Returns the document frequency
- property termcount
Returns the number of terms in the index
Training triplets
- XPM Configdatamaestro_text.data.ir.TrainingTriplets(*, id)
Bases:
Base
Triplet for training IR systems: query / query ID, positive document, negative document
- id: str
The unique dataset ID
- XPM Configdatamaestro_text.data.ir.PairwiseSampleDataset(*, id)
Bases:
Base
Datasets where each record is a query with positive and negative samples
- id: str
The unique dataset ID
- XPM Configdatamaestro_text.data.ir.TrainingTripletsLines(*, id, sep, path, doc_ids, topic_ids)
Bases:
TrainingTriplets
Training triplets with one line per triple (query texts)
- id: str
The unique dataset ID
- sep: str
- path: Path
- doc_ids: bool
True if we have documents IDs
- topic_ids: bool
True if we have query IDs
- XPM Configdatamaestro_text.data.ir.csv.TrainingTriplets(*, id, path, separator)
Bases:
TrainingTriplets
Training triplets (full text)
- id: str
The unique dataset ID
- path: Path
- separator: str
- ids: bool = Trueconstant
- XPM Configdatamaestro_text.data.ir.csv.TrainingTripletsID(*, id, sep, path, doc_ids, topic_ids, separator, documents, topics)
Bases:
TrainingTripletsLines
Training triplets (query/document IDs only)
- id: str
The unique dataset ID
- sep: str
- path: Path
- doc_ids: bool
True if we have documents IDs
- topic_ids: bool
True if we have query IDs
- separator: str
Field separator
- documents: datamaestro_text.data.ir.Documents
The documents
- topics: datamaestro_text.data.ir.Topics
The topics
- ids: bool = Trueconstant
Whether documents are IDs or full text
- XPM Configdatamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset(*, id, repo_id, data_files, split, ids, query_id, pos_id, neg_id)
Bases:
HuggingFaceDataset
,PairwiseSampleDataset
Triplet for training IR systems: query / query ID, positive document, negative document
- id: str
The unique dataset ID
- repo_id: str
- data_files: str
- split: str
- ids: bool
True if the triplet is made of IDs, False otherwise
- query_id: str = qid
The name of the field containing the query ID
- pos_id: str = pos
The name of the field containing the positive samples
- neg_id: str = neg
The name of the field containing the negative samples