Text Datasets

This section lists datasets consisting of raw text corpora, useful for language model pre-training and other NLP tasks.

BookCorpus

The BookCorpus dataset contains text from thousands of free books from Smashwords. It was originally used for training models like GPT and BERT.

Dataset com.smashwords.bookcorpus

→ datamaestro_text.data.text.TextFolder

Unpublished books from Smashwords

Tags: English, books, text

Tasks: language modeling

The books are concatened in two files hosted on huggingface NLP storage. Each sentence is on a separate line and tokens are space separated.

Example usage:

from datamaestro import prepare_dataset

dataset = prepare_dataset("com.smashwords.bookcorpus")
# Access the text folder
text_folder = dataset.path