Text API

This module provides data types for raw text datasets.

These types are used for datasets that provide text content without additional structure (as opposed to IR datasets which have documents, topics, etc.).

Text Storage

Basic containers for text data:

XPM Configdatamaestro_text.data.text.TextFolder(*, id, path)

Bases: Folder

A folder composed of texts

id: str

The unique (sub-)dataset ID

path: Path

A folder containing text files. Access the path via the path attribute.

XPM Configdatamaestro_text.data.text.TextFile(*, id, path)

Bases: File

A file composed of texts

id: str

The unique (sub-)dataset ID

path: Path

The path of the file

A single file containing text content. Access the path via the path attribute.

Training Datasets

For machine learning tasks with train/test splits:

XPM Configdatamaestro_text.data.text.TrainingText(*, id, train, validation, test)

Bases: Supervised

“A dataset used for training with a train and a test

id: str

The unique (sub-)dataset ID

train: datamaestro.data.Base
validation: datamaestro.data.Base
test: datamaestro.data.Base

A supervised learning dataset with train, test, and optional validation splits.

Example usage:

from datamaestro import prepare_dataset

# Load a text training dataset
dataset = prepare_dataset("org.allenai.bookcorpus")

# Access the training data
train_data = dataset.train
test_data = dataset.test  # may be None
validation_data = dataset.validation  # may be None