Datamaestro Text
datamaestro-text is a datamaestro plugin that provides access to text-related datasets for research in:
Information Retrieval (IR) - Document collections, topics, relevance judgments, training triplets
Natural Language Processing (NLP) - Text corpora, tagging datasets
Conversational IR - Query rewriting, conversational search datasets
Word Embeddings - Pre-trained word vectors (GloVe, etc.)
Recommendation - Rating datasets (MovieLens, IMDB)
Installation
Install from PyPI:
pip install datamaestro-text
For development:
git clone https://github.com/bpiwowar/datamaestro-text.git
cd datamaestro-text
pip install -e ".[dev]"
Quick Start
List available datasets:
# List all datasets in the text repository
datamaestro search text
# Search for specific datasets
datamaestro search "msmarco"
Load a dataset in Python:
from datamaestro import prepare_dataset
# Load MS MARCO passage dataset
dataset = prepare_dataset("com.microsoft.msmarco.passage")
# Access documents, topics, and relevance judgments
for doc in dataset.documents.iter_documents():
print(doc[IDItem].id, doc[TextItem].text)
The plugin also provides access to the ir-datasets library
through the irds namespace:
# Load via ir-datasets integration
dataset = prepare_dataset("irds.msmarco-passage")
Key Concepts
- Data Types
Schema classes that define the structure of datasets (e.g.,
Documents,Topics,Adhoc). See the Datamaestro Text API for the complete API reference.- Dataset Configurations
Specific dataset definitions that implement data types with download URLs and processing logic. See Datamaestro Text Datasets for available datasets.
- Records and Items
Typed data containers using the experimaestro record system. Common items include
IDItem(identifiers) andTextItem(text content).