Information Retrieval API

Data objects

class datamaestro_text.data.ir.base.BaseHolder

Bases: object

Base class for topics and documents

class datamaestro_text.data.ir.base.Document

Bases: BaseHolder

Base class for documents

class datamaestro_text.data.ir.base.FullGenericDocument(internal_id: int, id: str, text: str)

Bases: TextHolder, IDHolder, InternalIDHolder, Document

Documents with ID and text

class datamaestro_text.data.ir.base.FullIDDocument(id: str, internal_id: int)

Bases: InternalIDHolder, IDHolder, Document

Documents with internal and external ID

class datamaestro_text.data.ir.base.GenericDocument(id: str, text: str)

Bases: TextHolder, IDHolder, Document

Documents with ID and text

class datamaestro_text.data.ir.base.GenericTopic(id: str, text: str)

Bases: TextHolder, IDHolder, Topic

class datamaestro_text.data.ir.base.IDDocument(id: str)

Bases: IDHolder, Document

Documents with ID

class datamaestro_text.data.ir.base.IDHolder(id: str)

Bases: BaseHolder

Base data class for ID only data structures

class datamaestro_text.data.ir.base.IDTopic(id: str)

Bases: IDHolder, Topic

class datamaestro_text.data.ir.base.InternalIDHolder(internal_id: int)

Bases: BaseHolder

Base data class for ID only data structures

class datamaestro_text.data.ir.base.TextDocument(text: str)

Bases: TextHolder, Document

Documents with text

class datamaestro_text.data.ir.base.TextHolder(text: str)

Bases: BaseHolder

Base data class for text only data structures

class datamaestro_text.data.ir.base.TextTopic(text: str)

Bases: TextHolder, Topic

class datamaestro_text.data.ir.base.Topic

Bases: BaseHolder

Collection

XPM Configdatamaestro_text.data.ir.Adhoc(*, id, documents, topics, assessments)

Bases: Base

An Adhoc IR collection with documents, topics and their assessments

id: str

The unique dataset ID

documents: datamaestro_text.data.ir.Documents

The set of documents

topics: datamaestro_text.data.ir.Topics

The set of topics

assessments: datamaestro_text.data.ir.AdhocAssessments

The set of assessments (for each topic)

Topics

XPM Configdatamaestro_text.data.ir.Topics(*, id)

Bases: Base

A set of topics with associated IDs

id: str

The unique dataset ID

count() int | None

Returns the number of topics if known

iter() Iterator[Topic]

Returns an iterator over topics

XPM Configdatamaestro_text.data.ir.csv.Topics(*, id, separator, path)

Bases: Topics

Pairs of query id - query using a separator

id: str

The unique dataset ID

separator: str
path: Path
class datamaestro_text.data.ir.Topic

Bases: BaseHolder

Documents

class datamaestro_text.data.ir.Document

Bases: BaseHolder

Base class for documents

XPM Configdatamaestro_text.data.ir.Documents(*, id, count)

Bases: Base

A set of documents with identifiers

See IR Datasets for the list of query classes

id: str

The unique dataset ID

count: int

Number of documents

property documentcount

Returns the number of terms in the index

iter_ids() Iterator[str]

Iterates over document ids

By default, use iter_documents, which is not really efficient.

XPM Configdatamaestro_text.data.ir.cord19.Documents(*, id, path, delimiter, ignore, names_row, count)

Bases: Documents, Generic

id: str

The unique dataset ID

path: Path

The path of the file

delimiter: str = ,
ignore: int = 0
names_row: int = -1
count: int

Number of documents

XPM Configdatamaestro_text.data.ir.csv.Documents(*, id, count, path, separator)

Bases: Documents

One line per document, format pid<SEP>text

id: str

The unique dataset ID

count: int

Number of documents

path: Path
separator: str

Assessments

XPM Configdatamaestro_text.data.ir.AdhocAssessments(*, id)

Bases: Base, ABC

Ad-hoc assessments (qrels)

id: str

The unique dataset ID

id: str

The unique dataset ID

iter() Iterator[AdhocAssessment]

Returns an iterator over assessments

XPM Configdatamaestro_text.data.ir.trec.TrecAdhocAssessments

Bases: AdhocAssessments

id: str

The unique dataset ID

class datamaestro_text.data.ir.AdhocAssessment(doc_id: str)

Bases: object

Runs

XPM Configdatamaestro_text.data.ir.AdhocRun(*, id)

Bases: Base

IR adhoc run

id: str

The unique dataset ID

XPM Configdatamaestro_text.data.ir.csv.AdhocRunWithText(*, id, separator, path)

Bases: AdhocRun

(qid, doc.id, query, passage)

id: str

The unique dataset ID

separator: str
path: Path
XPM Configdatamaestro_text.data.ir.trec.TrecAdhocRun(*, id, path)

Bases: AdhocRun

id: str

The unique dataset ID

path: Path

Results

XPM Configdatamaestro_text.data.ir.trec.TrecAdhocResults(*, id, metrics, results, detailed)

Bases: AdhocResults

Adhoc results (TREC format)

id: str

The unique dataset ID

metrics: List[datamaestro_text.data.ir.Measure]

List of reported metrics

results: Path

Main results

detailed: Path

Results per topic (if any)

get_results() Dict[str, float]

Returns the results as a dictionary {metric_name: value}

Reranking

XPM Configdatamaestro_text.data.ir.RerankAdhoc(*, id, documents, topics, assessments, run)

Bases: Adhoc

Re-ranking ad-hoc task based on an existing run

id: str

The unique dataset ID

documents: datamaestro_text.data.ir.Documents

The set of documents

topics: datamaestro_text.data.ir.Topics

The set of topics

assessments: datamaestro_text.data.ir.AdhocAssessments

The set of assessments (for each topic)

run: datamaestro_text.data.ir.AdhocRun

The run to re-rank

Document Index

XPM Configdatamaestro_text.data.ir.DocumentStore(*, id, count)

Bases: Documents

A document store

A document store can - match external/internal ID - return the document content - return the number of documents

id: str

The unique dataset ID

count: int

Number of documents

docid_internal2external(docid: int)

Converts an internal collection ID (integer) to an external ID

property documentcount

Returns the number of terms in the index

iter_sample(randint: Callable[[int], int] | None) Iterator[Document]

Sample documents from the dataset

XPM Configdatamaestro_text.data.ir.AdhocIndex(*, id, count)

Bases: DocumentStore

An index can be used to retrieve documents based on terms

id: str

The unique dataset ID

count: int

Number of documents

term_df(term: str)

Returns the document frequency

property termcount

Returns the number of terms in the index

Training triplets

XPM Configdatamaestro_text.data.ir.TrainingTriplets(*, id)

Bases: Base

Triplet for training IR systems: query / query ID, positive document, negative document

id: str

The unique dataset ID

iter() Iterator[Tuple[Topic, Document, Document]]

Returns an iterator

XPM Configdatamaestro_text.data.ir.PairwiseSampleDataset(*, id)

Bases: Base

Datasets where each record is a query with positive and negative samples

id: str

The unique dataset ID

XPM Configdatamaestro_text.data.ir.TrainingTripletsLines(*, id, sep, path, doc_ids, topic_ids)

Bases: TrainingTriplets

Training triplets with one line per triple (query texts)

id: str

The unique dataset ID

sep: str
path: Path
doc_ids: bool

True if we have documents IDs

topic_ids: bool

True if we have query IDs

XPM Configdatamaestro_text.data.ir.csv.TrainingTriplets(*, id, path, separator)

Bases: TrainingTriplets

Training triplets (full text)

id: str

The unique dataset ID

path: Path
separator: str
ids: bool = Trueconstant
XPM Configdatamaestro_text.data.ir.csv.TrainingTripletsID(*, id, sep, path, doc_ids, topic_ids, separator, documents, topics)

Bases: TrainingTripletsLines

Training triplets (query/document IDs only)

id: str

The unique dataset ID

sep: str
path: Path
doc_ids: bool

True if we have documents IDs

topic_ids: bool

True if we have query IDs

separator: str

Field separator

documents: datamaestro_text.data.ir.Documents

The documents

topics: datamaestro_text.data.ir.Topics

The topics

ids: bool = Trueconstant

Whether documents are IDs or full text

XPM Configdatamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset(*, id, repo_id, data_files, split, ids, query_id, pos_id, neg_id)

Bases: HuggingFaceDataset, PairwiseSampleDataset

Triplet for training IR systems: query / query ID, positive document, negative document

id: str

The unique dataset ID

repo_id: str
data_files: str
split: str
ids: bool

True if the triplet is made of IDs, False otherwise

query_id: str = qid

The name of the field containing the query ID

pos_id: str = pos

The name of the field containing the positive samples

neg_id: str = neg

The name of the field containing the negative samples