Text datasets

This section lists datasets that are made of raw text.

Book Corpus

Dataset com.smashwords.bookcorpus

datamaestro_text.data.text.TextFolder

Unpublished books from Smashwords

Tags: text, English, books

Tasks: language modeling

External link: https://yknzhu.wixsite.com/mbweb

The books are concatened in two files hosted on huggingface NLP storage. Each sentence is on a separate line and tokens are space separated.