Word Embeddings
Pre-trained word embeddings for NLP tasks.
GloVe
GloVe (Global Vectors for Word Representation) embeddings from Stanford NLP. Available in multiple dimensions (50, 100, 200, 300) trained on different corpora.
edu.stanford.glove
- GloVe is an unsupervised learning algorithm for obtaining vector representations for words.
Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
-
Dataset edu.stanford.glove.6b
datamaestro_text.data.embeddings.WordEmbeddingsText
Embeddings for 6B words in various dimensions
Tags: word embeddings
-
Dataset edu.stanford.glove.6b.50
datamaestro_text.data.embeddings.WordEmbeddingsText
Glove 6B - dimension 50
Tags: word embeddings
-
Dataset edu.stanford.glove.6b.100
datamaestro_text.data.embeddings.WordEmbeddingsText
Glove 6B - dimension 100
Tags: word embeddings
-
Dataset edu.stanford.glove.6b.200
datamaestro_text.data.embeddings.WordEmbeddingsText
Glove 6B - dimension 200
Tags: word embeddings
-
Dataset edu.stanford.glove.6b.300
datamaestro_text.data.embeddings.WordEmbeddingsText
Glove 6B - dimension 200
Tags: word embeddings
-
Dataset edu.stanford.glove.42b
datamaestro_text.data.embeddings.WordEmbeddingsText
Glove embeddings trained on Common Crawl with 42B tokens
Tags: word embeddings
-
Dataset edu.stanford.glove.840b
datamaestro_text.data.embeddings.WordEmbeddingsText
Glove embeddings trained on Common Crawl with 840B tokens
Tags: word embeddings
Example usage:
from datamaestro import prepare_dataset
# Load 100-dimensional GloVe trained on Wikipedia + Gigaword
glove = prepare_dataset("edu.stanford.glove.6b.100")
# Load embeddings into memory
words, vectors = glove.load()
# Create lookup dictionary
word_to_idx = {w: i for i, w in enumerate(words)}
# Get embedding for a word
idx = word_to_idx.get("example")
if idx is not None:
embedding = vectors[idx]