Text API
This module provides data types for raw text datasets.
These types are used for datasets that provide text content without additional structure (as opposed to IR datasets which have documents, topics, etc.).
Text Storage
Basic containers for text data:
- XPM Configdatamaestro_text.data.text.TextFolder(*, id, path)
Bases:
FolderA folder composed of texts
- id: str
The unique (sub-)dataset ID
- path: Path
A folder containing text files. Access the path via the path attribute.
- XPM Configdatamaestro_text.data.text.TextFile(*, id, path)
Bases:
FileA file composed of texts
- id: str
The unique (sub-)dataset ID
- path: Path
The path of the file
A single file containing text content. Access the path via the path attribute.
Training Datasets
For machine learning tasks with train/test splits:
- XPM Configdatamaestro_text.data.text.TrainingText(*, id, train, validation, test)
Bases:
Supervised“A dataset used for training with a train and a test
- id: str
The unique (sub-)dataset ID
- train: datamaestro.data.Base
- validation: datamaestro.data.Base
- test: datamaestro.data.Base
A supervised learning dataset with train, test, and optional validation splits.
Example usage:
from datamaestro import prepare_dataset
# Load a text training dataset
dataset = prepare_dataset("org.allenai.bookcorpus")
# Access the training data
train_data = dataset.train
test_data = dataset.test # may be None
validation_data = dataset.validation # may be None