Text Datasets

This section lists datasets consisting of raw text corpora, useful for language model pre-training and other NLP tasks.

BookCorpus

The BookCorpus dataset contains text from thousands of free books from Smashwords. It was originally used for training models like GPT and BERT.

Dataset com.smashwords.bookcorpus

datamaestro_text.data.text.TextFolder

Unpublished books from Smashwords

Tags: English, books, text

Tasks: language modeling

External link: https://yknzhu.wixsite.com/mbweb

The books are concatened in two files hosted on huggingface NLP storage. Each sentence is on a separate line and tokens are space separated.

Example usage:

from datamaestro import prepare_dataset

dataset = prepare_dataset("com.smashwords.bookcorpus")
# Access the text folder
text_folder = dataset.path