Word Embeddings

Pre-trained word embeddings for NLP tasks.

GloVe

GloVe (Global Vectors for Word Representation) embeddings from Stanford NLP. Available in multiple dimensions (50, 100, 200, 300) trained on different corpora.

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Dataset edu.stanford.glove.6b

datamaestro_text.data.embeddings.WordEmbeddingsText

Embeddings for 6B words in various dimensions

Dataset edu.stanford.glove.6b.50

datamaestro_text.data.embeddings.WordEmbeddingsText

Glove 6B - dimension 50

Dataset edu.stanford.glove.6b.100

datamaestro_text.data.embeddings.WordEmbeddingsText

Glove 6B - dimension 100

Dataset edu.stanford.glove.6b.200

datamaestro_text.data.embeddings.WordEmbeddingsText

Glove 6B - dimension 200

Dataset edu.stanford.glove.6b.300

datamaestro_text.data.embeddings.WordEmbeddingsText

Glove 6B - dimension 200

Dataset edu.stanford.glove.42b

datamaestro_text.data.embeddings.WordEmbeddingsText

Glove embeddings trained on Common Crawl with 42B tokens

Dataset edu.stanford.glove.840b

datamaestro_text.data.embeddings.WordEmbeddingsText

Glove embeddings trained on Common Crawl with 840B tokens

Example usage:

from datamaestro import prepare_dataset

# Load 100-dimensional GloVe trained on Wikipedia + Gigaword
glove = prepare_dataset("edu.stanford.glove.6b.100")

# Load embeddings into memory
words, vectors = glove.load()

# Create lookup dictionary
word_to_idx = {w: i for i, w in enumerate(words)}

# Get embedding for a word
idx = word_to_idx.get("example")
if idx is not None:
    embedding = vectors[idx]