Home
This is a collection of European Portuguese verbal paradigms, in phonemic notation.
They are suited for both computational and manual analysis. The paradigms
table lists all
available lexemes, and provides full paradigms for each. The segments
table lists all phonemes
used in the transcription, and describes them in terms of distinctive features.
The European Portuguese Verbs lexicon is licensed under Attribution-ShareAlike 4.0 International
Please cite as:
- Perdigão, Fernando, Beniamine, Sacha, Luís, Ana R., & Bonami, Olivier. (2021). European Portuguese Verbal Paradigms in Phonemic Notation [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5121543
Version 1.0.1 of this lexicon was prepared for the publication:
- Sacha Beniamine, Olivier Bonami, and Ana R. Luís (2021), The fine implicative structure of European Portuguese conjugation, Isogloss. Open Journal of Romance Linguistics. DOI: https://doi.org/10.5565/rev/isogloss.109
The data can be downloaded from zenodo or from the gitlab repository.
How this lexicon was prepared
We selected the 5000 most frequent verb lexemes in the CETEMPúblico corpus (Santos and Rocha, 2001), relying on frequency lists provided by the AC/DC project. Full paradigms in phonemic transcriptions for these verbs were generated using pronunciation dictionaries and text to speech tools developed at the University of Coimbra (Candeias, Veiga,and Perdigão, 2015; Marquiafável et al., 2014). We made further adjustments by hand. In the process, a handful of verbs had to be excluded.
References
- See https://lsi.co.it.pt/verbos/ for full paradigms with orthographic forms, transcriptions, and audio synthesis. Transcriptions may vary slightly from the conventions used here.
- Candeias, Sara, Arlindo Veiga, and Fernando Perdigão (2015). Pronunciação de Verbos Portugueses - Guia Prático. LIDEL.
- Marquiafável,Vanessaetal.(Oct.2014).“Rule-Based Algorithms for Automatic Pronunciation of Portuguese Verbal Inflections”. In: International Conf. on ComputationalProcessing of Portuguese - PROPOR. Vol. 8775, pp. 36–47. DOI: 10.1007/978-3-319-09761-9_4.
- Santos, Diana and Paulo Rocha (2001). “Evaluating CETEMPúblico, a free resource for Portuguese”. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 450–457. DOI: 10.3115/1073012.1073070
Scripts
The only dependency is pandas
(version 1.2.4
was used). The python version used was 3.8
.
To re-generate the lexicon, navigate to the data repository and run:
python3 src/format_lexicon.py
To run tests:
python3 -m unittest tests/test_lexicon.py
For paralex validation, after installing paralex
:
paralex validate *.package.json
Format
The data files are encoded in csv
files, and the metadata follows frictionless standards. The dataset conforms to the Paralex standard