Information Retrieval API

Data objects

class datamaestro_text.data.ir.base.BaseRecord(*items: Dict[Type[T], T] | T, no_check=False): Bases: Record

class datamaestro_text.data.ir.base.DocumentRecord(*items: Dict[Type[T], T] | T, no_check=False)

Bases: BaseRecord

Document record

class datamaestro_text.data.ir.base.GenericDocumentRecord(*items: Dict[Type[T], T] | T, no_check=False)

Bases: DocumentRecord

itemtypes: ClassVar[List[Type[T]]] = [<class 'datamaestro_text.data.ir.base.IDItem'>, <class 'datamaestro_text.data.ir.base.TextItem'>]: For specific records, this is the list of types

class datamaestro_text.data.ir.base.GenericTopicRecord(*items: Dict[Type[T], T] | T, no_check=False)

Bases: TopicRecord

itemtypes: ClassVar[List[Type[T]]] = [<class 'datamaestro_text.data.ir.base.IDItem'>, <class 'datamaestro_text.data.ir.base.TextItem'>]: For specific records, this is the list of types

class datamaestro_text.data.ir.base.IDDocumentRecord(*items: Dict[Type[T], T] | T, no_check=False)

Bases: DocumentRecord

itemtypes: ClassVar[List[Type[T]]] = [<class 'datamaestro_text.data.ir.base.IDItem'>]: For specific records, this is the list of types

class datamaestro_text.data.ir.base.IDItem(id: str)

Bases: Item, ABC

A topic/document with an external ID

class datamaestro_text.data.ir.base.IDTopicRecord(*items: Dict[Type[T], T] | T, no_check=False)

Bases: TopicRecord

itemtypes: ClassVar[List[Type[T]]] = [<class 'datamaestro_text.data.ir.base.IDItem'>]: For specific records, this is the list of types

class datamaestro_text.data.ir.base.InternalIDItem(id: int)

Bases: Item, ABC

A topic/document with an internal ID

class datamaestro_text.data.ir.base.ScoredItem(score: float)

Bases: Item

A score associated with the document

score: float: A retrieval score associated with this record (e.g. of the first-stage retriever)

class datamaestro_text.data.ir.base.SimpleTextDocumentRecord(*items: Dict[Type[T], T] | T, no_check=False)

Bases: DocumentRecord

itemtypes: ClassVar[List[Type[T]]] = [<class 'datamaestro_text.data.ir.base.SimpleTextItem'>]: For specific records, this is the list of types

class datamaestro_text.data.ir.base.SimpleTextItem(text: str)

Bases: TextItem

A topic/document with a text record

class datamaestro_text.data.ir.base.SimpleTextTopicRecord(*items: Dict[Type[T], T] | T, no_check=False)

Bases: TopicRecord

itemtypes: ClassVar[List[Type[T]]] = [<class 'datamaestro_text.data.ir.base.SimpleTextItem'>]: For specific records, this is the list of types

class datamaestro_text.data.ir.base.TextItem

Bases: Item, ABC

abstract property text: str: Returns the text

class datamaestro_text.data.ir.base.TopicRecord(*items: Dict[Type[T], T] | T, no_check=False)

Bases: BaseRecord

Topic record

Collection

XPM Configdatamaestro_text.data.ir.Adhoc(*, id, documents, topics, assessments)

Bases: Base

Submit type: datamaestro_text.data.ir.Adhoc

An Adhoc IR collection with documents, topics and their assessments

id: str: The unique dataset ID

documents: datamaestro_text.data.ir.Documents: The set of documents

topics: datamaestro_text.data.ir.Topics: The set of topics

assessments: datamaestro_text.data.ir.AdhocAssessments: The set of assessments (for each topic)

Topics

XPM Configdatamaestro_text.data.ir.Topics(*, id)

Bases: Base, ABC

Submit type: datamaestro_text.data.ir.Topics

A set of topics with associated IDs

id: str: The unique dataset ID

count() → int | None: Returns the number of topics if known

abstract iter() → Iterator[TopicRecord]: Returns an iterator over topics

XPM Configdatamaestro_text.data.ir.csv.Topics(*, id, separator, path)

Bases: Topics

Submit type: datamaestro_text.data.ir.csv.Topics

Pairs of query id - query using a separator

id: str: The unique dataset ID

separator: str

path: Path

XPM Configdatamaestro_text.transforms.ir.TopicWrapper

Bases: Config, ABC

Submit type: datamaestro_text.transforms.ir.TopicWrapper

Modify topics on the fly using a topic wrapper

XPM Configdatamaestro_text.data.ir.trec.TrecTopics(*, id, path, parts)

Bases: Topics

Submit type: datamaestro_text.data.ir.trec.TrecTopics

id: str: The unique dataset ID

path: Path

parts: List[str]

Documents

XPM Configdatamaestro_text.data.ir.Documents(*, id, count)

Bases: Base

Submit type: datamaestro_text.data.ir.Documents

A set of documents with identifiers

See IR Datasets for the list of query classes

id: str: The unique dataset ID

count: int: Number of documents

property documentcount: Returns the number of terms in the index

iter_ids() → Iterator[str]

Iterates over document ids

By default, use iter_documents, which is not really efficient.

XPM Configdatamaestro_text.data.ir.cord19.Documents(*, id, path, delimiter, ignore, names_row, count)

Bases: Documents, Generic

Submit type: datamaestro_text.data.ir.cord19.Documents

id: str: The unique dataset ID

path: Path: The path of the file

delimiter: str = ,

ignore: int = 0

names_row: int = -1

count: int: Number of documents

XPM Configdatamaestro_text.data.ir.trec.TipsterCollection(*, id, count, path)

Bases: Documents

Submit type: datamaestro_text.data.ir.trec.TipsterCollection

id: str: The unique dataset ID

count: int: Number of documents

path: Path

Assessments

XPM Configdatamaestro_text.data.ir.AdhocAssessments(*, id)

Bases: Base, ABC

Submit type: datamaestro_text.data.ir.AdhocAssessments

Ad-hoc assessments (qrels)

id: str: The unique dataset ID

id: str: The unique dataset ID

iter() → Iterator[AdhocAssessment]: Returns an iterator over assessments

XPM Configdatamaestro_text.data.ir.trec.TrecAdhocAssessments

Bases: AdhocAssessments

Submit type: datamaestro_text.data.ir.trec.TrecAdhocAssessments

id: str: The unique dataset ID

class datamaestro_text.data.ir.AdhocAssessment(doc_id: str): Bases: object

Runs

XPM Configdatamaestro_text.data.ir.AdhocRun(*, id)

Bases: Base

Submit type: datamaestro_text.data.ir.AdhocRun

IR adhoc run

id: str: The unique dataset ID

XPM Configdatamaestro_text.data.ir.csv.AdhocRunWithText(*, id, separator, path)

Bases: AdhocRun

Submit type: datamaestro_text.data.ir.csv.AdhocRunWithText

(qid, doc.id, query, passage)

id: str: The unique dataset ID

separator: str

path: Path

XPM Configdatamaestro_text.data.ir.trec.TrecAdhocRun(*, id, path)

Bases: AdhocRun

Submit type: datamaestro_text.data.ir.trec.TrecAdhocRun

id: str: The unique dataset ID

path: Path

Results

XPM Configdatamaestro_text.data.ir.trec.TrecAdhocResults(*, id, metrics, results, detailed)

Bases: AdhocResults

Submit type: datamaestro_text.data.ir.trec.TrecAdhocResults

Adhoc results (TREC format)

id: str: The unique dataset ID

metrics: List[datamaestro_text.data.ir.Measure]: List of reported metrics

results: Path: Main results

detailed: Path: Results per topic (if any)

get_results() → Dict[str, float]: Returns the results as a dictionary {metric_name: value}

Reranking

XPM Configdatamaestro_text.data.ir.RerankAdhoc(*, id, documents, topics, assessments, run)

Bases: Adhoc

Submit type: datamaestro_text.data.ir.RerankAdhoc

Re-ranking ad-hoc task based on an existing run

id: str: The unique dataset ID

documents: datamaestro_text.data.ir.Documents: The set of documents

topics: datamaestro_text.data.ir.Topics: The set of topics

assessments: datamaestro_text.data.ir.AdhocAssessments: The set of assessments (for each topic)

run: datamaestro_text.data.ir.AdhocRun: The run to re-rank

Document Index

XPM Configdatamaestro_text.data.ir.DocumentStore(*, id, count)

Bases: Documents

Submit type: datamaestro_text.data.ir.DocumentStore

A document store

A document store can - match external/internal ID - return the document content - return the number of documents

id: str: The unique dataset ID

count: int: Number of documents

docid_internal2external(docid: int): Converts an internal collection ID (integer) to an external ID

property documentcount: Returns the number of terms in the index

iter_sample(randint: Callable[[int], int] | None) → Iterator[DocumentRecord]: Sample documents from the dataset

XPM Configdatamaestro_text.data.ir.AdhocIndex(*, id, count)

Bases: DocumentStore

Submit type: datamaestro_text.data.ir.AdhocIndex

An index can be used to retrieve documents based on terms

id: str: The unique dataset ID

count: int: Number of documents

term_df(term: str): Returns the document frequency

property termcount: Returns the number of terms in the index

Training triplets

XPM Configdatamaestro_text.data.ir.TrainingTriplets(*, id)

Bases: Base, ABC

Submit type: datamaestro_text.data.ir.TrainingTriplets

Triplet for training IR systems: query / query ID, positive document, negative document

id: str: The unique dataset ID

iter() → Iterator[Tuple[TopicRecord, DocumentRecord, DocumentRecord]]: Returns an iterator over (topic, document 1, document) triplets

XPM Configdatamaestro_text.data.ir.PairwiseSampleDataset(*, id)

Bases: Base, ABC

Submit type: datamaestro_text.data.ir.PairwiseSampleDataset

Datasets where each record is a query with positive and negative samples

id: str: The unique dataset ID

XPM Configdatamaestro_text.data.ir.TrainingTripletsLines(*, id, sep, path, doc_ids, topic_ids)

Bases: TrainingTriplets

Submit type: datamaestro_text.data.ir.TrainingTripletsLines

Training triplets with one line per triple (query texts)

id: str: The unique dataset ID

sep: str

path: Path

doc_ids: bool: True if we have documents IDs

topic_ids: bool: True if we have query IDs

XPM Configdatamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset(*, id, repo_id, data_files, split, ids, query_id, pos_id, neg_id)

Bases: HuggingFaceDataset, PairwiseSampleDataset

Submit type: datamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset

Triplet for training IR systems: query / query ID, positive document, negative document

id: str: The unique dataset ID

repo_id: str

data_files: str

split: str

ids: bool: True if the triplet is made of IDs, False otherwise

query_id: str = qid: The name of the field containing the query ID

pos_id: str = pos: The name of the field containing the positive samples

neg_id: str = neg: The name of the field containing the negative samples

Transforms

XPM Configdatamaestro_text.transforms.ir.StoreTrainingTripletTopicAdapter(*, id, store, data)

Bases: TrainingTriplets

Submit type: datamaestro_text.transforms.ir.StoreTrainingTripletTopicAdapter

Retrieve an adhoc topic text from a topic store (given the topic ID)

id: str

store: datamaestro_text.data.ir.TopicsStore: The topic store to use

data: datamaestro_text.data.ir.TrainingTriplets: Input data

XPM Configdatamaestro_text.transforms.ir.StoreTrainingTripletDocumentAdapter(*, id, store, data)

Bases: TrainingTriplets

Submit type: datamaestro_text.transforms.ir.StoreTrainingTripletDocumentAdapter

Transforms training triplets to add the document text from a document store

id: str

store: datamaestro_text.data.ir.DocumentStore: The topic store to use

data: datamaestro_text.data.ir.TrainingTriplets: Input data

XPM Taskdatamaestro_text.transforms.ir.ShuffledTrainingTripletsLines(*, data, doc_ids, topic_ids, seed, compressed, sample_rate, sample_max)

Bases: Task

Submit type: Any

Shuffle a set of training triplets

data: datamaestro_text.data.ir.TrainingTriplets: Input data

path: Pathgenerated: Output path

doc_ids: bool: Whether to use document ids

topic_ids: bool: True if we have query IDs

seed: int: The random seed

compressed: bool = True: Compress the output

sample_rate: float = 1.0: Sampling rate - set to 1 to keep all the samples

sample_max: int = 0: Maximum number of samples

tmp_path: Pathgenerated: Path where temporary files will be stored