Text API

This module provides data types for raw text datasets.

These types are used for datasets that provide text content without additional structure (as opposed to IR datasets which have documents, topics, etc.).

Text Storage

Basic containers for text data:

XPM Configdatamaestro_text.data.text.TextFolder(*, id, path)

Bases: Folder

A folder composed of texts

id: str: The unique (sub-)dataset ID

path: path

A folder containing text files. Access the path via the path attribute.

XPM Configdatamaestro_text.data.text.TextFile(*, id, path)

Bases: File

A file composed of texts

id: str: The unique (sub-)dataset ID

path: path: The path of the file

A single file containing text content. Access the path via the path attribute.

Training Datasets

For machine learning tasks with train/test splits:

XPM Configdatamaestro_text.data.text.TrainingText(*, id, train, validation, test)

Bases: Supervised

“A dataset used for training with a train and a test

id: str: The unique (sub-)dataset ID

train: datamaestro.data.Base

validation: datamaestro.data.Base

test: datamaestro.data.Base

A supervised learning dataset with train, test, and optional validation splits.

Example usage:

from datamaestro import prepare_dataset

# Load a text training dataset
dataset = prepare_dataset("org.allenai.bookcorpus")

# Access the training data
train_data = dataset.train
test_data = dataset.test  # may be None
validation_data = dataset.validation  # may be None