NLP
This module provides data types for Natural Language Processing datasets, particularly those involving linguistic annotations.
CoNLL-U Format
The CoNLL-U format is a standard format for annotated linguistic data used in Universal Dependencies and other NLP tasks.
- XPM Configdatamaestro_text.data.tagging.CoNLL_U(*, id, path)
Bases:
File- id: str
The unique (sub-)dataset ID
- path: path
The path of the file
CoNLL-U files contain token-level annotations including:
Word forms and lemmas
Universal POS tags and language-specific POS tags
Morphological features
Dependency relations (head and deprel)
Miscellaneous annotations
Example CoNLL-U format:
# sent_id = 1
# text = The dog runs.
1 The the DET DT _ 2 det _ _
2 dog dog NOUN NN _ 3 nsubj _ _
3 runs run VERB VBZ _ 0 root _ _
4 . . PUNCT . _ 3 punct _ _