Data set name: Estonian Paradigms in Phonemic Notation
Citation (if available): Sacha Beniamine, Mari Aigro, Matthew Baerman & Maria Copot. Estonian Paradigms in Phonemic Notation [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8383523
Data set developer(s): Sacha Beniamine
Data sheet author(s): Sacha Beniamine
Others who contributed to this document:
Motivation
For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.
This dataset was created in order to study the inflectional morphology of Estonian. It is intended for use in NLP and linguistic investigation.
Who created the dataset (for example, which team, research group) and on behalf of which entity (for example, company, institution, organization)?
- Sacha Beniamine, University of Surrey: dataset creation and management
- Mari Aigro, University of Tartu: Language expertise, phonemic transcription
- Matthew Baerman, University of Surrey: dataset design, language expertise, phonemic transcription
- Maria Copot, University of Ohio: corrections to the distinctive features in the sound file
Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.
This work was partially funded by a British Academy International Newton Fellowship (NIF23\100218) and by the Leverhulme Early Career Fellowship (ECF-2022-286), both awarded to Sacha Beniamine.
Any other comments?
We thank Indrek Hein and Ylle Viks for providing us access to Ekilex data and plenty of support.
Composition
Paralex datasets document paradigms of inflected forms.
Are forms given as orthographic, phonetic, and or phonemic sequences ?
Forms are given as both phonemic and orthographic sequences.
How many instances are there in total?
- Number of inflected forms: 452 885 distinct inflected forms.
- Number of lexemes: 10551 lexemes in total; 5475 nouns and 5076 verbs.
- Maximal paradigm size in cells: 79 cells in total: 28 nominal cells; 51 verbal cells.
Language varieties
Languages differ from each other in structural ways that can interact with NLP algorithms. Within a language, regional or social dialects can also show great variation (Chambers and Trudgill, 1998). The language and language variety should be described with a language tag from BCP-47 identifying the language variety (e.g., en-US or yue-Hant-HK), and a prose description of the language variety, glossing the BCP-47 tag and also providing further information (e.g., "English as spoken in Palo Alto, California", or "Cantonese written with traditional characters by speakers in Hong Kong who are bilingual in Mandarin").
- BCP-47 language tag: et
- Language variety description: Estonian
Does the data pertain to specific dialects, geographical locations, genre, etc ?
No
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (for example, geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (for example, to cover a more diverse range of instances, because instances were withheld or unavailable).
We selected the 5000 most frequent verbs and 5000 most frequent nouns in the Estonian national corpus files, excluding the web sections. When querying the Ekilex api, this resulted in slightly over 5000 lexemes in each category, due to homonymy. Any Ekilex entries which were neither marked as nouns or verbs, or where no inflected forms were present, were excluded.
Is any information missing from individual instances?
If so, please provide a description, explaining why this information is missing (for example, because it was unavailable). This does not include intentionally removed information, but might include, for example, redacted text.
- What was the reasoning behind the selection of defective forms ?
Defective forms are included. We marked as defective any forms where paradigms details .forms[].value
was either empty or "-"; except in the SgAdt (parallel form of the illative singular, which is not always present).
Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.
-
How is certainty expressed (doubtful or unattested forms)
-
The epistemic tag "questionable" was added for forms which were marked as questionable in ekilex paradigm details
.forms[].questionable
. - The defectiveness tag "starred" was added to forms which had initial stars in the annotated orthographic form, meaning that they can be formed, but usually do not occur.
Is the dataset self-contained, or does it link to or otherwise rely on external resources (for example, websites, tweets, other datasets)?
If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (that is, including the external resources as they existed at the time the dataset was created); c) are there any restrictions (for example, licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.
Some identifiers are references to Ekilex data:
- WordId
in lexemes is a reference to the word details .word.wordId
- paradigmId
in lexemes is a reference to paradigm details .forms[].paradigmId
- form_id
in paradigms is a reference to .forms[].id
If linking to vocabularies from other databases (such as databases of features, cells, sounds, languages, or online dictionnaries), were there any complex decisions in the matching of entries from this dataset to those of the vocabularies (eg. inexact language code) ?
NA.
Does the dataset contain data that might be considered confidential (for example, data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)? If so, please provide a description.
No.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.
No.
Any other comments?
No.
Collection process.
What is provenance for each table (lexemes, cells, forms, frequencies, sounds, features), as well as for segmentation marks if any ? Are any information derived from other datasets ?
Was any information (forms, lexemes, cells, frequency) extracted from a corpus, a dictionnary, elicitated, extracted from field notes, digitized from grammars, generated ? What are the sources ?
- Forms: Orthographic forms were queried from Ekilex. Phonemic forms were transduced from the annotated orthographic forms using cutom Epitran rules.
- Lexemes: they were chosen by taking the 5000 most frequent lemmas in the Estonian national corpus files, excluding the web sections.
- Cells and features values tables were elaborated by hand
- Sounds: were first generated using Hayes universal Spreadsheet, then corrected by hand by Mari Aigro & Maria Copot.
See:
Hayes, Bruce (2012). Spreadsheet with segments and their feature values. Distributed as part of course material for Linguistics 120A : Phonology I at UCLA. URL : http://www.linguistics.ucla.edu/people/hayes/120a/index.htm.
How were paradigms separated between lexemes (eg. in the case of homonyms or variants) ? What theoretical or practical choices were made ?
We counted one lexeme for each combination of a unique Word and Paradigm ID in Ekilex. This means that words which can take two distinct inflection classes constitute two lexeme entries. Parallel forms are however part of the same paradigms.
How was the paradigm structure (set and labels of paradigm cells) decided ? What theoretical or practical choices were made ?
We adhered to the structure provided in ekilex, except that we marked the SgAdt as a parallel form of the illative singular (ill.sg).
What is the expertise of the contributors with the documented language ?
Are they areal expert, language experts, native speakers ?
Mari Aigro is an areal expert and native speaker. Other contributors are morpholists.
How was the data collected (for example, manual human curation, generation by software programs, software APIs, etc)? How were these mechanisms or procedures validated?
We rely on Ekilex data for orthographic forms; combined annotation with those from vabamorf.
Transcriptions are generated through Epitran rules which can be found in the folder .epitran/
. To validate those, we annotated 144 forms presenting various particularities (/evaluation/dev_forms.csv), found in the literature or added manually during rule elaborations. We validated our generation procedure against these forms. Numerous forms were also checked manually in the process.
If the dataset is a sample from a larger set, what was the sampling strategy (for example, deterministic, probabilistic with specific sampling probabilities)? > Curation rationale: Which lemmas, forms, cells were included and what were the goals in selecting entries, both in the original collection and in any further sub-selection? This can be especially important in datasets too large to thoroughly inspect by hand. An explicit statement of the curation rationale can help dataset users make inferences about what other kinds of texts systems trained with them could conceivably generalize to.
See above
Who was involved in the data collection process (for example, students, crowdworkers, contractors) and how were they compensated (for example, how much were crowdworkers paid)?
NA, all collaborators are researchers cited above.
Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (for example, recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.
NA.
Were any ethical review processes conducted (for example, by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.
NA.
Any other comments?
No.
Preprocessing/cleaning/labeling.
How were frequencies measured ? If this was done directly from a corpus, is the software for frequency extraction available ?
Frequencies were measured from the National Corpus (excluding web sections) using the count_frequencies
function in format_lexicon.py
. This is a very simple script which counts occurences of each nominal and verbal lemma in the corpus.
How were the inflected forms obtained ? If generated, what was the generation process ? Is the software for generation available ?
See above, by querying the Ekilex API.
How were the phonological or phonemic transcriptions obtained ? If generated, what was the generation process ? Is the software for generation available ?
See above, using custom Epitran phonemic rules; with a few post-processing steps to make manual corrections or annotations that require lookup across paradigm forms or paradigm-external information. All are accounted for in the script format_lexicon.py
.
If relevant, how were the forms segmented ?
The initial segmentation was retained. For phonemic forms, we changed all segmentation marks to +
.
Was any preprocessing/cleaning/labeling of the data done (for example, discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values, cleaning of labels, mapping between vocabularies, etc)? If so, please provide a description. If not, you may skip the remaining questions in this section. This includes estimation of frequencies.
- The cell vocabulary mapping is given in
estonian_cells.csv
. The SgAdt was remapped to ill.sg.
Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (for example, to support unanticipated future uses)? If so, please provide a link or other access point to the "raw" data.
Raw data for the Json files queried from the API are not made public, but could be downloaded again with the proper credentials using the format_lexicon.py
script.
Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.
Yes, see format_lexicon.py
.
Any other comments?
No.
Uses
Has the dataset been used for any published work already? If so, please provide a description.
No.
Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.
No.
What (other) tasks could the dataset be used for?
Any NLP task concerned with inflection and based on phonemic form; linguistic investigations into inflection, whether quantitative or qualitative.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a dataset consumer might need to know to avoid uses that could result in unfair treatment of individuals or groups (for example, stereotyping, quality of service issues) or other risks or harms (for example, legal risks, financial harms)? If so, please provide a description. Is there anything a dataset consumer could do to mitigate these risks or harms?
Automated phonemic transcriptions may be less accurate for borrowings. We did out best to mitigate this issue when the information was available.
Are there tasks for which the dataset should not be used? If so, please provide a description.
NA
Any other comments?
No.
Distribution.
Will the dataset be distributed to third parties outside of the entity (for example, company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.
No.
How will the dataset be distributed (for example, tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?
DOI: https://doi.org/10.5281/zenodo.8383523 The DOI points to a zenodo deposit The dataset is available as a repository on gitlab: https://gitlab.com/sbeniamine/estonianparadigms/-/releases
When will the dataset be distributed?
It is already distributed.
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/ or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.
License: Attribution-ShareAlike 4.0 International
Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.
No.
Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.
No.
Any other comments?
No.
Maintenance
Who will be supporting/hosting/maintaining the dataset?
Sacha Beniamine
How can the owner/curator/manager of the dataset be contacted (for example, email address)?
Please raise an issue on the gitlab repository, or email at s.
Is there an erratum? If so, please provide a link or other access point.
No.
Will the dataset be updated (for example, to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (for example, mailing list, GitHub)?
Yes, whenever relevant. Updates will be pushed to gitlab and lead to new versions, themselves pushed to zenodo.
If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (for example, were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.
No.
Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.
Yes, thanks to zenodo & gitlab.
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description.
We welcome merge requests on gitlab.
Any other comments?
No.