Text Datasets ============= This section lists datasets consisting of raw text corpora, useful for language model pre-training and other NLP tasks. BookCorpus ---------- The BookCorpus dataset contains text from thousands of free books from `Smashwords `_. It was originally used for training models like GPT and BERT. .. dm:datasets:: com.smashwords.bookcorpus Example usage: .. code-block:: python from datamaestro import prepare_dataset dataset = prepare_dataset("com.smashwords.bookcorpus") # Access the text folder text_folder = dataset.path