Data set name: European Portuguese Verbal Paradigms in Phonemic Notation
Citation (if available): Perdigão, Fernando, Beniamine, Sacha, Luís, Ana R., & Bonami, Olivier. (2021). European Portuguese Verbal Paradigms in Phonemic Notation [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5121543
Data set developer(s): Sacha Beniamine
Data sheet author(s): Sacha Beniamine
Others who contributed to this document: None
Motivation
For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.
This dataset was created in order to study patterns of interpredictibility within European Portuguese verbal paradigms. It is intended for use in NLP and linguistic investigation.
Who created the dataset (for example, which team, research group) and on behalf of which entity (for example, company, institution, organization)?
Creators of the dataset were Fernando Perdigão (Universidade de Coimbra, DEEC-FCTUC); Sacha Beniamine (University of Surrey, SMG), ; Ana R. Luís (Universidade de Coimbra, CELGA-ILTEC) and Olivier Bonami (Université de Paris, LLF, CNRS).
Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.
This work was partially funded by a British Academy International Newton Fellowship (NIF23\100218) and by the grant "EFL - Empirical Foundations of Linguistics : data, methods, models" (10-LABX-0083).
Composition
Paralex datasets document paradigms of inflected forms.
Are forms given as orthographic, phonetic, and or phonemic sequences ?
Forms are given as phonemic sequences.
How many instances are there in total?
- Number of inflected forms: 324415 distinct inflected forms
- Number of lexemes: 4992 lexemes
- Maximal paradigm size in cells: 65 cells
Language varieties
Languages differ from each other in structural ways that can interact with NLP algorithms. Within a language, regional or social dialects can also show great variation (Chambers and Trudgill, 1998). The language and language variety should be described with a language tag from BCP-47 identifying the language variety (e.g., en-US or yue-Hant-HK), and a prose description of the language variety, glossing the BCP-47 tag and also providing further information (e.g., "English as spoken in Palo Alto, California", or "Cantonese written with traditional characters by speakers in Hong Kong who are bilingual in Mandarin").
- BCP-47 language tag: pt
- Language variety description: European Portuguese
Does the data pertain to specific dialects, geographical locations, genre, etc ?
The generated pronunciation is that of European Portuguese. The transcription used here corresponds to a possible realisation in a formal context with a relatively slow speech rate, which minimises the instances of fusion or coarticulation of vocalic sounds.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (for example, geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (for example, to cover a more diverse range of instances, because instances were withheld or unavailable).
We selected the 5000 most frequent verb lexemes in the CETEMPúblico corpus (Santos and Rocha, 2001), relying on frequency lists provided by the AC/DC project. Furthermore, 8 lexemes were excluded during manual corrections.
An important feature of this lexicon is that only one form is given per paradigm cell of each lexeme: overabundance or defectivity are not taken into account. Notably, the lexicon records a single form of the past participle, in this case the most "regular" according to traditional description, and does not take into account variation in gender and number.
Is any information missing from individual instances?
If so, please provide a description, explaining why this information is missing (for example, because it was unavailable). This does not include intentionally removed information, but might include, for example, redacted text.
There was no missing data, and we did not annotate defective forms.
Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.
Not to the best of our knowledge, although we welcome any feedback if mistakes were made.
Is the dataset self-contained, or does it link to or otherwise rely on external resources (for example, websites, tweets, other datasets)?
If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (that is, including the external resources as they existed at the time the dataset was created); c) are there any restrictions (for example, licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.
The dataset is complete and self-contained.
If linking to vocabularies from other databases (such as databases of features, cells, sounds, languages, or online dictionnaries), were there any complex decisions in the matching of entries from this dataset to those of the vocabularies (eg. inexact language code) ?
N/A
Does the dataset contain data that might be considered confidential (for example, data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)? If so, please provide a description.
No.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.
No.
Any other comments?
No.
Collection process.
What is provenance for each table (lexemes, cells, forms, frequencies, sounds, features), as well as for segmentation marks if any ? Are any information derived from other datasets ?
Was any information (forms, lexemes, cells, frequency) extracted from a corpus, a dictionnary, elicitated, extracted from field notes, digitized from grammars, generated ? What are the sources ?
- Forms: Full paradigms in phonemic transcriptions for these verbs were generated using pronunciation dictionaries and text to speech tools developed at the University of Coimbra (Candeias, Veiga,and Perdigão, 2015; Marquiafável et al., 2014). We made further adjustments by hand.
- Lexemes were chosen by taking the 5000 most frequent verb lexemes in the CETEMPúblico corpus (Santos and Rocha, 2001), relying on frequency lists provided by the AC/DC project. 8 lexemes were then excluded during manual annotation.
- The phonemes table was adapted from that provided in: Sacha Beniamine (2018). Classifications flexionnelles: Étude quantitative des structures de paradigmes. In: Université Sorbonne Paris Cité - Université Paris Diderot (Paris 7), PhD thesis under the supervision of Olivier Bonami.
How were paradigms separated between lexemes (eg. in the case of homonyms or variants) ? What theoretical or practical choices were made ?
Lexemes display no homonymy or variants; we provide one full paradigm per lexeme.
How was the paradigm structure (set and labels of paradigm cells) decided ? What theoretical or practical choices were made ?
Our selection of cells reflects the general form of a verbal paradigm of European Portuguese:
Finite forms: - 5 tenses within the indicative mood (present, imperfect, simple past, pluperfect, future), - 3 tenses within the subjunctive mood (past, present, future), a conditional, and an imperative with only second person forms.
Non-finite forms:
- infinitive (inf)
- personal infinitive (per.inf).
- past participle
- gerund
We left aside: - the 3sg, 1pl and 3pl forms that are sometimes listed in the imperative but are really present subjunctive forms that can be used in contexts similar to that of the imperative. - variation in gender and number of the past participle.
What is the expertise of the contributors with the documented language ?
Are they areal expert, language experts, native speakers ?
Fernando Perdigão and Ana R. Luís are both native speakers and areal experts.
How was the data collected (for example, manual human curation, generation by software programs, software APIs, etc)? How were these mechanisms or procedures validated?
Fernando Perdigão provided the paradigms for the selected verbs. These were obtained using pronunciation dictionaries and text to speech tools developed at the University of Coimbra and corrected by hand. For more information, see:
- Candeias, Sara, Arlindo Veiga, and Fernando Perdigão (2015). Pronunciação de Verbos Portugueses - Guia Prático. LIDEL.
- Marquiafável,Vanessaetal.(Oct.2014).“Rule-Based Algorithms for Automatic Pronunciation of Portuguese Verbal Inflections”. In: International Conf. on ComputationalProcessing of Portuguese - PROPOR. Vol. 8775, pp. 36–47. DOI: 10.1007/978-3-319-09761-9_4.
- Santos, Diana and Paulo Rocha (2001). “Evaluating CETEMPúblico, a free resource for Portuguese”. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 450–457. DOI: 10.3115/1073012.1073070
If the dataset is a sample from a larger set, what was the sampling strategy (for example, deterministic, probabilistic with specific sampling probabilities)? > Curation rationale: Which lemmas, forms, cells were included and what were the goals in selecting entries, both in the original collection and in any further sub-selection? This can be especially important in datasets too large to thoroughly inspect by hand. An explicit statement of the curation rationale can help dataset users make inferences about what other kinds of texts systems trained with them could conceivably generalize to.
See above in the Composition section for lexeme selection.
Who was involved in the data collection process (for example, students, crowdworkers, contractors) and how were they compensated (for example, how much were crowdworkers paid)?
NA
Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (for example, recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.
NA
Were any ethical review processes conducted (for example, by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.
NA
Any other comments?
No.
Preprocessing/cleaning/labeling.
How were the inflected forms obtained ? If generated, what was the generation process ? Is the software for generation available ?
See above regarding data generation and references on the software used.
If relevant, how were the forms segmented ?
NA
Was any preprocessing/cleaning/labeling of the data done (for example, discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values, cleaning of labels, mapping between vocabularies, etc)? If so, please provide a description. If not, you may skip the remaining questions in this section. This includes estimation of frequencies.
Phonological annotation was adjusted, with all changes accounted for in the script src/format_lexicon.py
: stress was added in all forms, nasal diphthong annotation was adjusted to show nasality only on the first vowel, a few forms were corrected where manual inspection uncovered mistakes.
Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (for example, to support unanticipated future uses)? If so, please provide a link or other access point to the "raw" data.
The "raw" data can be found in src/PronunciarListaVebos2021_v2.txt
.
Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.
See src/format_lexicon.py
. Metadata was generated by gen-metadata.py
.
Any other comments?
On phonemic notation: The transcriptions used are surface-oriented and standardised. The transcriptions adopted are similar to those of Mateus and d’Andrade (2000), with four differences: (i) semi-vowels are not distinguished from high vowels; (ii) the non-low central vowel is transcribed [ə] rather than [ɨ]; (iii) stress is marked by using the IPA symbol placed immediately before the stressed vowel ; (iv) diphthongs are written using the glides [j] and [w] after the initial vowel.
Uses
Has the dataset been used for any published work already? If so, please provide a description.
This work was used in:
- Sacha Beniamine, Olivier Bonami, and Ana R. Luís (2021), The fine implicative structure of European Portuguese conjugation, Isogloss. Open Journal of Romance Linguistics. DOI: https://doi.org/10.5565/rev/isogloss.109
Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.
No
What (other) tasks could the dataset be used for?
Any NLP task concerned with inflection and based on phonemic form; linguistic investigations into inflection, whether quantitative or qualitative.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a dataset consumer might need to know to avoid uses that could result in unfair treatment of individuals or groups (for example, stereotyping, quality of service issues) or other risks or harms (for example, legal risks, financial harms)? If so, please provide a description. Is there anything a dataset consumer could do to mitigate these risks or harms?
NA
Are there tasks for which the dataset should not be used? If so, please provide a description.
NA
Any other comments?
No.
Distribution.
Will the dataset be distributed to third parties outside of the entity (for example, company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.
No.
How will the dataset be distributed (for example, tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?
DOI: https://doi.org/10.5281/zenodo.5121543 The DOI points to a zenodo deposit The dataset is available as a repository on gitlab: https://gitlab.com/sbeniamine/europeanportugueseverbs
When will the dataset be distributed?
It is already distributed.
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/ or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.
License: Attribution-ShareAlike 4.0 International
Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.
No.
Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.
No.
Any other comments?
No.
Maintenance
Who will be supporting/hosting/maintaining the dataset?
Sacha Beniamine
How can the owner/curator/manager of the dataset be contacted (for example, email address)?
Please raise an issue on the gitlab repository, or email at s.
Is there an erratum? If so, please provide a link or other access point.
No.
Will the dataset be updated (for example, to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (for example, mailing list, GitHub)?
Yes, whenever relevant. Updates will be pushed to gitlab and lead to new versions, themselves pushed to zenodo.
If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (for example, were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.
No.
Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.
Yes, thanks to zenodo & gitlab.
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description.
We welcome merge requests on gitlab.
Any other comments?
No.