Getting Started
This guide shows how to use datamaestro-text to access datasets for your research.
Loading Datasets
All datasets are loaded using the prepare_dataset function:
from datamaestro import prepare_dataset
# Load by dataset ID
dataset = prepare_dataset("com.microsoft.msmarco.passage")
Dataset IDs follow a hierarchical naming convention based on their source
(e.g., com.microsoft.msmarco.passage for MS MARCO passage ranking).
Working with IR Datasets
Information Retrieval datasets typically contain three components:
Documents - The collection of documents to search
Topics - Queries or information needs
Assessments - Relevance judgments (qrels)
Example with MS MARCO:
from datamaestro import prepare_dataset
from datamaestro.record import IDItem, TextItem
# Load the dataset
adhoc = prepare_dataset("com.microsoft.msmarco.passage")
# Iterate over documents
for doc in adhoc.documents.iter_documents():
doc_id = doc[IDItem].id
doc_text = doc[TextItem].text
print(f"Document {doc_id}: {doc_text[:100]}...")
# Iterate over topics (queries)
for topic in adhoc.topics.iter():
topic_id = topic[IDItem].id
query_text = topic[TextItem].text
print(f"Query {topic_id}: {query_text}")
# Access relevance judgments
for assessed_topic in adhoc.assessments.iter():
topic_id = assessed_topic.topic_id
for assessment in assessed_topic.assessments:
doc_id = assessment.doc_id
relevance = assessment.rel
Using IR-Datasets Integration
The plugin provides access to the ir-datasets library
through the irds namespace. This gives access to hundreds of IR datasets:
from datamaestro import prepare_dataset
# Load via ir-datasets
dataset = prepare_dataset("irds.msmarco-passage")
# Same interface as native datasets
for doc in dataset.documents.iter_documents():
print(doc[IDItem].id)
See IR-Datasets Integration for the full list of available ir-datasets.
Training Data for Neural IR
For training neural ranking models, use training triplets:
from datamaestro import prepare_dataset
from datamaestro.record import TextItem
# Load training triplets
triplets = prepare_dataset("com.microsoft.msmarco.passage.train.idstriples.small")
# Iterate over (query, positive, negative) triplets
for triplet in triplets.iter():
query = triplet.query[TextItem].text
positive_doc = triplet.positive[TextItem].text
negative_doc = triplet.negative[TextItem].text
Word Embeddings
Load pre-trained word embeddings:
from datamaestro import prepare_dataset
# Load GloVe embeddings
glove = prepare_dataset("edu.stanford.glove.6b.50")
# Load word vectors
words, vectors = glove.load()
# vectors is a numpy matrix where vectors[i] is the embedding for words[i]
print(f"Vocabulary size: {len(words)}")
print(f"Embedding dimension: {vectors.shape[1]}")
Dataset Discovery
Find available datasets from the command line:
# List all text datasets
datamaestro search text
# Search by keyword
datamaestro search "trec"
datamaestro search "conversation"
# Show dataset details
datamaestro info com.microsoft.msmarco.passage
Or programmatically:
from datamaestro import Repository
# Get the text repository
repo = Repository.find("text")
# List all dataset IDs
for dataset_id in repo.datasetids():
print(dataset_id)
User Agreements
Some datasets require accepting a user agreement before download. When you first access such a dataset, datamaestro will prompt you to accept the terms.
You can pre-accept agreements:
datamaestro prepare com.microsoft.msmarco.passage
Caching and Data Directory
Downloaded data is cached in ~/.local/share/datamaestro by default.
You can change this by setting the DATAMAESTRO_DATA environment variable:
export DATAMAESTRO_DATA=/path/to/data