Word Embeddings
This module provides data types for pre-trained word embeddings.
Word embeddings are dense vector representations of words, useful for NLP tasks, semantic similarity, and as input features for neural models.
Base Class
- XPM Configdatamaestro_text.data.embeddings.WordEmbeddings(*, id)
Bases:
BaseGeneric word embeddings
- id: str
The unique (sub-)dataset ID
Abstract base class for word embeddings. Provides the load() method
that returns a tuple of (words, vectors) where:
wordsis a list of vocabulary wordsvectorsis a numpy matrix wherevectors[i]is the embedding forwords[i]
File-Based Embeddings
- XPM Configdatamaestro_text.data.embeddings.WordEmbeddingsText(*, id, path, encoding)
Bases:
WordEmbeddings,FileWord embeddings as a text word / values
- id: str
The unique (sub-)dataset ID
- path: path
The path of the file
- encoding: str = utf-8
Word embeddings stored in a text file with format: word value1 value2 ... valueN
Example usage:
from datamaestro import prepare_dataset
# Load GloVe embeddings (50-dimensional)
glove = prepare_dataset("edu.stanford.glove.6b.50")
# Load into memory
words, vectors = glove.load()
# Create a word-to-index mapping
word_to_idx = {word: idx for idx, word in enumerate(words)}
# Get embedding for a word
if "computer" in word_to_idx:
embedding = vectors[word_to_idx["computer"]]
print(f"Embedding shape: {embedding.shape}")
# Available GloVe variants:
# - edu.stanford.glove.6b.50 (50d, trained on 6B tokens)
# - edu.stanford.glove.6b.100 (100d)
# - edu.stanford.glove.6b.200 (200d)
# - edu.stanford.glove.6b.300 (300d)