Word Embeddings =============== This module provides data types for pre-trained word embeddings. Word embeddings are dense vector representations of words, useful for NLP tasks, semantic similarity, and as input features for neural models. Base Class ---------- .. autoxpmconfig:: datamaestro_text.data.embeddings.WordEmbeddings Abstract base class for word embeddings. Provides the ``load()`` method that returns a tuple of ``(words, vectors)`` where: - ``words`` is a list of vocabulary words - ``vectors`` is a numpy matrix where ``vectors[i]`` is the embedding for ``words[i]`` File-Based Embeddings --------------------- .. autoxpmconfig:: datamaestro_text.data.embeddings.WordEmbeddingsText Word embeddings stored in a text file with format: ``word value1 value2 ... valueN`` Example usage: .. code-block:: python from datamaestro import prepare_dataset # Load GloVe embeddings (50-dimensional) glove = prepare_dataset("edu.stanford.glove.6b.50") # Load into memory words, vectors = glove.load() # Create a word-to-index mapping word_to_idx = {word: idx for idx, word in enumerate(words)} # Get embedding for a word if "computer" in word_to_idx: embedding = vectors[word_to_idx["computer"]] print(f"Embedding shape: {embedding.shape}") # Available GloVe variants: # - edu.stanford.glove.6b.50 (50d, trained on 6B tokens) # - edu.stanford.glove.6b.100 (100d) # - edu.stanford.glove.6b.200 (200d) # - edu.stanford.glove.6b.300 (300d)