Eesthetic is a collection of Estonian verbal and nominal paradigms, in phonemic and orthographic notation. They are suited for both computational and manual analysis.
The data files are encoded in csv
files, and the metadata follows frictionless standards. The dataset conforms to the Paralex standard
The Estonian Paradigms in Phonemic Notation dataset is licensed under Attribution-ShareAlike 4.0 International
Please cite as:
- Sacha Beniamine, Mari Aigro, Matthew Baerman, Jules Bouton & Maria Copot (2024). Eesthetic: A Paralex Lexicon of Estonian Paradigms. In Proceedings of The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING). To appear.
- Sacha Beniamine, Mari Aigro, Matthew Baerman, Jules Bouton & Maria Copot. Estonian Paradigms in Phonemic Notation [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8383523
The data can be downloaded from zenodo or from the gitlab repository; and visualized online.
We thank Indrek Hein and Ylle Viks for providing us access to Ekilex data and plenty of support.
References
This dataset is derived from Ekilex data. See:
- Ekilex API. https://github.com/tripledev/ekilex/wiki/Ekilex-API
- Sõnaveeb. Eesti Keele Instituudi keeleportaal. https://sonaveeb.ee
- Tavast, Arvi, et al. "Towards the superdictionary: Layers, tools and unidirectional meaning relations." EURALEX XIX (2021).
- National Estonian Corpus 2021 (https://doi.org/10.15155/3-00-0000-0000-0000-08E60L)
How this lexicon was prepared
We selected the 5000 most frequent verbs and 5000 most frequent nouns in the Estonian national corpus files, excluding the web sections. For each corresponding lexeme, we then queries the Ekilex api (see the git repository and sonaveeb) in order to obtain inflected forms in orthographic notation. We used epitran custom rules in order to convert these into phonemic notation. We performed extensive manual verifications, and corrected some forms by hand. The input to epitran rules are the annotated orthographic forms from EKILEX, where:
```` the grave mark (backtick) signals a syllable in quantity 3, and is placed before the first vowel of the syllable, eg: k
indel, kont
ert, allergia, esse
ist(\
)` indicates that the syllable can be either Q2 or Q3. In this case, the transcription uses Q3.´
an acute accents marks stressed syllables when they are not predictible.(´)
an acute accent in round brackets indicates that the syllable can be either stressed or unstressed. In this cae, the transcription includes the stress.[
a square bracket marks the border between the word stem and the ending of the change in the word form. Note that this is a non-standard marker for Paralex.+
a plus marks word boundaries: liit+word
₊
an underscore plus marks boundaries in loan words: tele₊skoop, de₊odor
an't, plei₊b
oi'
a straight apostrophy indicates palatalization (after a consonant): kul'u, not'su, k`un'st
As annotation was sometimes incomplete, we added more using the vabamorf tool. Diacritics from vabamorf are different, but were converted to the ekilex notation. Any other diacritics present in ekilex, such as the "~" separating overabundant forms; a "*" marking impossible forms, etc, were removed.
How to re-generate the data
Dependencies: epitran, paralex
Getting national corpus files: https://metashare.ut.ee/repository/browse/estonian-national-corpus-2021-vert/f176ccc0d05511eca6e4fa163e9d454794df2849e11048bb9fa104f1fec2d03f/
Finding frequencies:
python3 format_lexicon.py --count_freq
To download data from Ekilex, you need to create a file mysecrets.py
containing a variable with the Ekilex api key:
api_key = "<API_KEY>"
Fetching all the paradigm info from ekilex (this is very long !):
python3 format_lexicon.py --query_ekilex
Extracting the lexicon from json -> csv:
python3 format_lexicon.py --extract
Install vabamorf from https://github.com/Filosoft/vabamorf/
(requires g++ and gmake):
git clone https://github.com/Filosoft/vabamorf/tree/master
cd vabamorf/apps/cmdline/project/unix
make -s etsyn
The path vabamorf/apps/cmdline/project/unix
will contain the executable etsyn, thus it should be in your PATH.
Then get the dictionary file:
wget https://github.com/Filosoft/vabamorf/blob/master/dct/binary/et.dct -P vabamorf/
Then, produce vabamorf output files in raw/steps/*_with_vabamorf.csv
:
python3 format_lexicon.py --vabamorf
Evaluating the transcription on dev forms:
python3 format_lexicon.py --eval_dev
Phonological transcription:
python3 format_lexicon.py --transcribe
All at once (this can be really long !):
python3 format_lexicon.py --count_freq --query_ekilex --extract --vabamorf --transcribe