Datamaestro Text Datasets
This section lists the datasets available through the datamaestro-text plugin.
Datasets are organized by domain:
Information Retrieval Datasets - Information retrieval benchmark collections (MS MARCO, TREC, etc.)
IR-Datasets Integration - Integration with the ir-datasets library
Conversational IR Datasets - Conversational search and query reformulation
Text Datasets - Raw text corpora
Word Embeddings - Pre-trained word embeddings
Recommendation Datasets - Rating and recommendation datasets
To load a dataset:
from datamaestro import prepare_dataset
# Load by dataset ID
dataset = prepare_dataset("com.microsoft.msmarco.passage")
To discover available datasets:
# List all datasets
datamaestro search text
# Search by keyword
datamaestro search "trec"