Word Embeddings
Pre-trained word embeddings for NLP tasks.
GloVe
GloVe (Global Vectors for Word Representation) embeddings from Stanford NLP. Available in multiple dimensions (50, 100, 200, 300) trained on different corpora.
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
-
Dataset edu.stanford.glove.6b
datamaestro_text.data.embeddings.WordEmbeddingsText
Embeddings for 6B words in various dimensions
-
Dataset edu.stanford.glove.6b.50
datamaestro_text.data.embeddings.WordEmbeddingsText
Glove 6B - dimension 50
-
Dataset edu.stanford.glove.6b.100
datamaestro_text.data.embeddings.WordEmbeddingsText
Glove 6B - dimension 100
-
Dataset edu.stanford.glove.6b.200
datamaestro_text.data.embeddings.WordEmbeddingsText
Glove 6B - dimension 200
-
Dataset edu.stanford.glove.6b.300
datamaestro_text.data.embeddings.WordEmbeddingsText
Glove 6B - dimension 200
-
Dataset edu.stanford.glove.42b
datamaestro_text.data.embeddings.WordEmbeddingsText
Glove embeddings trained on Common Crawl with 42B tokens
-
Dataset edu.stanford.glove.840b
datamaestro_text.data.embeddings.WordEmbeddingsText
Glove embeddings trained on Common Crawl with 840B tokens
Example usage:
from datamaestro import prepare_dataset
# Load 100-dimensional GloVe trained on Wikipedia + Gigaword
glove = prepare_dataset("edu.stanford.glove.6b.100")
# Load embeddings into memory
words, vectors = glove.load()
# Create lookup dictionary
word_to_idx = {w: i for i, w in enumerate(words)}
# Get embedding for a word
idx = word_to_idx.get("example")
if idx is not None:
embedding = vectors[idx]