Information Retrieval API¶

Collection¶

XPM Configdatamaestro_text.data.ir.Adhoc(*, id, documents, topics, assessments)¶

An Adhoc IR collection with documents, topics and their assessments

id: str¶: The unique dataset ID

documents: datamaestro_text.data.ir.AdhocDocuments¶: The set of documents

topics: datamaestro_text.data.ir.AdhocTopics¶: The set of topics

assessments: datamaestro_text.data.ir.AdhocAssessments¶: The set of assessments (for each topic)

Topics¶

XPM Configdatamaestro_text.data.ir.AdhocTopics(*, id)¶

id: str¶: The unique dataset ID

iter() → Iterator[AdhocTopic]¶: Returns an iterator over topics

XPM Configdatamaestro_text.data.ir.csv.AdhocTopics(*, id, separator, path)¶

Pairs of query id - query using a separator

id: str¶: The unique dataset ID

separator: str¶

path: Path¶

class datamaestro_text.data.ir.AdhocTopic(qid: str, text: str, metadata: Dict[str, str])¶: The most generic topic: an ID with some text

Documents¶

XPM Configdatamaestro_text.data.ir.AdhocDocuments(*, id, count)¶

A set of documents with identifiers

id: str¶: The unique dataset ID

count: int¶: Number of documents

XPM Configdatamaestro_text.data.ir.cord19.Documents(*, id, path, delimiter, ignore, names_row, count)¶

id: str¶: The unique dataset ID

path: Path¶: The path of the file

delimiter: str = ,¶

ignore: int = 0¶

names_row: int = -1¶

count: int¶: Number of documents

XPM Configdatamaestro_text.data.ir.csv.AdhocDocuments(*, id, count, path, separator)¶

One line per document, format pid<SEP>text

id: str¶: The unique dataset ID

count: int¶: Number of documents

path: Path¶

separator: str¶

Assessments¶

XPM Configdatamaestro_text.data.ir.AdhocAssessments(*, id)¶

Ad-hoc assessements (qrels)

id: str¶: The unique dataset ID

id: str¶: The unique dataset ID

iter() → Iterator[AdhocAssessedTopic]¶: Returns an iterator over assessments

XPM Configdatamaestro_text.data.ir.trec.TrecAdhocAssessments¶

id: str¶: The unique dataset ID

class datamaestro_text.data.ir.AdhocAssessedTopic(qid: str, assessments: List[datamaestro_text.data.ir.AdhocAssessment])¶

class datamaestro_text.data.ir.AdhocAssessment(docno: str, rel: float)¶: Adhoc assessments associate a document ID with a relevance

Runs¶

XPM Configdatamaestro_text.data.ir.AdhocRun(*, id)¶

IR adhoc run

id: str¶: The unique dataset ID

XPM Configdatamaestro_text.data.ir.csv.AdhocRunWithText(*, id, separator, path)¶

(qid, doc.id, query, passage)

id: str¶: The unique dataset ID

separator: str¶

path: Path¶

Results¶

XPM Configdatamaestro_text.data.ir.trec.TrecAdhocResults(*, id, metrics, results, detailed)¶

Adhoc results (TREC format)

id: str¶: The unique dataset ID

metrics: List[datamaestro_text.data.ir.Measure]¶: List of reported metrics

results: Path¶: Main results

detailed: Path¶: Results per topic (if any)

get_results() → Dict[str, float]¶: Returns the results as a dictionary {metric_name: value}

Reranking¶

XPM Configdatamaestro_text.data.ir.RerankAdhoc(*, id, documents, topics, assessments, run)¶

Re-ranking ad-hoc task based on an existing run

id: str¶: The unique dataset ID

documents: datamaestro_text.data.ir.AdhocDocuments¶: The set of documents

topics: datamaestro_text.data.ir.AdhocTopics¶: The set of topics

assessments: datamaestro_text.data.ir.AdhocAssessments¶: The set of assessments (for each topic)

run: datamaestro_text.data.ir.AdhocRun¶: The run to re-rank

Document Index¶

XPM Configdatamaestro_text.data.ir.AdhocDocumentStore(*, id, count)¶

A document store

A document store can - match external/internal ID - return the document content - return the number of documents

id: str¶: The unique dataset ID

count: int¶: Number of documents

docid_internal2external(docid: int)¶: Converts an internal collection ID (integer) to an external ID

document(internal_docid: int) → AdhocDocument¶: Returns a document given its internal ID

document_text(docid: str) → str¶: Returns the text of the document given its id

property documentcount¶: Returns the number of terms in the index

iter_sample(randint: Optional[Callable[[int], int]]) → Iterator[AdhocDocument]¶: Sample documents from the dataset

XPM Configdatamaestro_text.data.ir.AdhocIndex(*, id, count)¶

An index can be used to retrieve documents based on terms

id: str¶: The unique dataset ID

count: int¶: Number of documents

term_df(term: str)¶: Returns the document frequency

property termcount¶: Returns the number of terms in the index

Training triplets¶

XPM Configdatamaestro_text.data.ir.TrainingTriplets(*, id, ids)¶

Triplet for training IR systems: query / query ID, positive document, negative document

id: str¶: The unique dataset ID

ids: bool¶

XPM Configdatamaestro_text.data.ir.PairwiseSampleDataset(*, id, ids)¶

Datasets where each record is a query with positive and negative samples

id: str¶: The unique dataset ID

ids: bool¶: Whether data are texts or IDs

XPM Configdatamaestro_text.data.ir.TrainingTripletsLines(*, id, ids, sep, path)¶

Training triplets with one line per triple (text only)

id: str¶: The unique dataset ID

ids: bool¶

sep: str¶

path: Path¶

XPM Configdatamaestro_text.data.ir.csv.TrainingTriplets(*, id, path, separator)¶

Training triplets (full text)

id: str¶: The unique dataset ID

ids: bool = Trueconstant¶

path: Path¶

separator: str¶

XPM Configdatamaestro_text.data.ir.csv.TrainingTripletsID(*, id, sep, path, separator, documents, topics)¶

Training triplets (query/document IDs only)

id: str¶: The unique dataset ID

ids: bool = Trueconstant¶: Whether documents are IDs or full text

sep: str¶

path: Path¶

separator: str¶: Field separator

documents: datamaestro_text.data.ir.AdhocDocuments¶: The documents

topics: datamaestro_text.data.ir.AdhocTopics¶: The topics

XPM Configdatamaestro_text.data.ir.huggingface.HuggingFacePairwiseSampleDataset(*, id, ids, repo_id, data_files, split, query_id, pos_id, neg_id)¶

Triplet for training IR systems: query / query ID, positive document, negative document

id: str¶: The unique dataset ID

ids: bool¶: True if the triplet is made of IDs, False otherwise

repo_id: str¶

data_files: str¶

split: str¶

query_id: str = qid¶: The name of the field containing the query ID

pos_id: str = pos¶: The name of the field containing the positive samples

neg_id: str = neg¶: The name of the field containing the negative samples