Text datasets
This section lists datasets that are made of raw text.
Book Corpus
-
Dataset com.smashwords.bookcorpus
datamaestro_text.data.text.TextFolder
Unpublished books from Smashwords
Tags: books, English, text
Tasks: language modeling
External link: https://yknzhu.wixsite.com/mbweb
The books are concatened in two files hosted on huggingface NLP storage. Each sentence is on a separate line and tokens are space separated.