Text Datasets
This section lists datasets consisting of raw text corpora, useful for language model pre-training and other NLP tasks.
BookCorpus
The BookCorpus dataset contains text from thousands of free books from Smashwords. It was originally used for training models like GPT and BERT.
-
Dataset com.smashwords.bookcorpus
datamaestro_text.data.text.TextFolder
Unpublished books from Smashwords
Tags: English, books, text
Tasks: language modeling
External link: https://yknzhu.wixsite.com/mbweb
The books are concatened in two files hosted on huggingface NLP storage. Each sentence is on a separate line and tokens are space separated.
Example usage:
from datamaestro import prepare_dataset
dataset = prepare_dataset("com.smashwords.bookcorpus")
# Access the text folder
text_folder = dataset.path