LE-PAROLE
Portuguese Lexicon Documentation
Institutions:
Centro de Linguística da Universidade de Lisboa (CLUL)
Instituto de Engenharia e Sistemas de Computadores (INESC) =96 Natural Language Group
1. General Design Information
This section presents a brief overview of the information present in the Portuguese PAROLE lexicon.
The selection of the lexical material was based on the frequency of the textual items in a subcorpus of 5 thousand words. This subcorpus was extracted from the CRPC - Corpus de Referência do Português Contemporâneo (a project of CLUL), with the following constitution:
- newspaper: 44%
- periodical: 16%
- book: 38%
- law: 2%
The corpus was lemmatized with the INESC's morphological analyzer Palavroso (Medeiros, 1995) and a frequency list of lemmas was extracted. All lemmas with frequency superior to the threshold established were considered and manually checked. About 18 thousand lemmas were selected from the subcorpus, but the percentages per grammatical category were not conformant to the percentages requested in the project. Thus some adjustments were done, including the addition of two thousand nouns extracted from the Palavroso Analyzer dictionary.
Some tematic words from Fundamental Portuguese (Bacelar et al., 1987), which did not attained the frequency level in the corpus, were added.
Afterwards, the frequency of the words gathered was checked in a 15 million running words corpus.
All the grammatical words and numerals were added separately, based on Palavroso dictionary.
As proper nouns only toponyms and antroponyms were considered.
For the selection of abbreviations and acronyms, the list of Palavroso was complemented by a selection of Parole corpus occurrences and by suggestions from the lexicographers.
2. Current Lexicon Contents
In this section we present factual data about the Portuguese lexicon in the Parole DTD conformant format.
2.1 Morphological layer
The morphological information (classification and paradigms) was converted from Palavroso lexicon. To represent the information embedded in Palavroso in the Genelex model was not a simple operation. As was demonstrated in the MLAP Parole report about the compatibility of our lexicon and the Genelex model (Medeiros et al., Oct./95), Palavroso is a rule based analyzer whereas Genelex uses a paradigmatic approach. This means that in Palavroso, the rule component is primary, and the lexicon seen as additional information. While in Genelex model lemmas have links to inflectional models, in Palavroso, rules and lemmas do not have explicit links between them. Thus, to obtain CIFs, the analysis rules should be somehow inverted. The solution adopted was to develop a generation module for the Palavroso system, able to generate and convert simultaneously into SGML format. Some special procedures were necessary, namely to handle verb forms when accompanied by clitics.
The generator was used to introduce verbs (including special forms when accompanied by clitics), nouns, adjectives, adverbs and numerals in the tool. Grammatical words, Symbols, Proper Nouns, Accronyms, Abbreviations, Foreign words. The tool generates the data into normal files, which are then loaded into AlethGD, using the standard fillers. Compounds, agglutinated forms and non autonomous words were introduced interactively
The general principles to identify the Morphological Units are coincident with the GENELEX principles, as specified in the "Raport sur la Couche Morphologique". There was one different criterion in the source data to be converted from Palavroso. In Palavroso, different gender of nouns always originates two entries. Therefore, all nouns of Parole lexicon were manually revised in order to consider only one MU when the gender variation reflects only the sex of the referent (menino/menina). With this adaptation the whole set of criteria is conformant to the GENELEX choices.
General counting array:
|
Nb of Morphological Units |
20435 |
|
Nb of Simple Morphological Units |
19782 |
|
Nb of Graphical Morphological Units |
19936 |
|
Number of simple words inflexion modes |
378 |
|
Number of compound words |
653 |
|
Number of compound words inflexion modes |
180 |
Grammatical Categories array
|
Category |
Subcategory |
Ums |
Umc |
Total |
Example |
|
WITHOUTC |
WITHOUTSC |
195 |
0 |
195 |
cima |
|
NOUN |
COMMON |
11305 |
329 |
11634 |
zaire |
|
NOUN |
PROPER |
429 |
103 |
532 |
Barbuda |
|
VERB |
AUX |
2 |
0 |
2 |
haver |
|
VERB |
MAIN |
2998 |
0 |
2998 |
abafar |
|
ADJECTIVE |
WITHOUTSC |
3218 |
22 |
3240 |
nascido |
|
PRONOUN |
DEMONSTRATIVE |
20 |
0 |
20 |
tal |
|
PRONOUN |
POSSESSIVE |
10 |
0 |
10 |
vosso |
|
PRONOUN |
INTERROGATIVE |
6 |
0 |
6 |
quem |
|
PRONOUN |
EXCLAMATIVE |
6 |
0 |
6 |
quem |
|
PRONOUN |
RECIPROCAL |
3 |
0 |
3 |
vos |
|
PRONOUN |
REFLEXIVE |
10 |
0 |
10 |
consigo |
|
PRONOUN |
PERSONAL |
28 |
0 |
28 |
vocês |
|
PRONOUN |
RELATIVE |
9 |
0 |
9 |
quem |
|
PRONOUN |
INDEFINITE |
40 |
5 |
45 |
várias |
|
ADVERB |
WITHOUTSC |
529 |
68 |
597 |
abaixo |
|
ADPOSITION |
WITHOUTSC |
27 |
40 |
67 |
sobre |
|
CONJUNCTION |
COORDINATIVE |
21 |
3 |
24 |
todavia |
|
CONJUNCTION |
SUBORDINATIVE |
18 |
27 |
45 |
que |
|
NUMERAL |
ORDINAL |
33 |
0 |
33 |
bilionésimo |
|
NUMERAL |
CARDINAL |
43 |
0 |
43 |
um |
|
DETERMINER |
DEMONSTRATIVE |
17 |
0 |
17 |
mesma |
|
DETERMINER |
POSSESSIVE |
10 |
0 |
10 |
vosso |
|
DETERMINER |
INTERROGATIVE |
6 |
0 |
6 |
quem |
|
DETERMINER |
EXCLAMATIVE |
6 |
0 |
6 |
quem |
|
DETERMINER |
INDEFINITE |
40 |
5 |
45 |
várias |
|
ARTICLE |
INDEFINITE |
2 |
0 |
2 |
uma |
|
ARTICLE |
DEFINITE |
2 |
0 |
2 |
o |
|
INTERJECTION |
WITHOUTSC |
35 |
1 |
36 |
ah |
|
RESIDUAL |
ACRONYM |
145 |
0 |
145 |
ACP |
|
RESIDUAL |
FOREIGN |
360 |
50 |
410 |
libris |
|
RESIDUAL |
SYMBOL |
146 |
0 |
146 |
=B0 |
|
RESIDUAL |
ABBREVIATION |
62 |
0 |
62 |
A. |
|
UNIQUE |
MEDIOPASSIVE |
1 |
0 |
1 |
se |
2.2 Syntactic layer
Similarly to the section about morphology, here we give a general overview of the method used to identify Syntactic Units and Syntactic Descriptions:
The splitting criteria to encode several syntactic units were:
- different syntactic functions for the same number of positions;
- different thematic roles for the same syntactic function;
- optionality;
- possibility of establishing relations in a frameset.
The following choices were made in relation with optionality or alternatives of realizations:
- those described in the model, i. e., whenever a position is optional it is not described, unless the optionality depends on the context: then a different syntactic unit will be considered and the context specified;
- if a position can be filled by different categories having the same syntactic function, they appear in the same syntactic unit under the relevant position;
- when the syntactic function is different a new syntactic unit will be considered.
The frameset construction strategy is used whenever:
- an argument can occur in different positions and with different syntactic functions (ex. subject of an intransitive construction versus object of a transitive one) thus needing to be related;
- a verb allows for a raising construction, having as an object or as a subject an argument that can also appear as an argument of an embedded clause (where it really belongs to), thus needing also to be related;
List of syntactic functions used for Portuguese:
- subject
- object
- indirect object
- oblique
- adjunct
- complement predicative of the subject
- complement predicative of the object
- v_modifier
- n_of_comp
- n_prep_comp
- n_apposition
- n_adjunct
- n_clausal_comp
- n_determinative
- n_attributive
- n_modifier
Strategy to describe control information:
- we distinguish control constructions from the raising ones. The latter are described through a frameset construction, which has a "raising" feature, and through the assignment of the same coref feature to the different positions and functions the argument can show. Control constructions also have one feature (subject_control, object_control...); Controller and Controllee share the same Coref feature. The Controllee occupies a position without an overt realization (PRO).
|
Number of syntactic Units |
23218 |
|
Number of constructions |
475 |
3. Bibliography
General:
[Medeiros 1995] José Carlos Medeiros. Processamento Morfológico e Correcção Ortográfica do Português. Tese de Mestrado, Instituto Superior Técnico - Universidade Técnica de Lisboa, Lisboa.
[Nascimento, Rivenc, Cruz 1987] Maria Fernanda B. do Nascimento, Paul Rivenc, M. Luísa Segura da Cruz. Português Fundamental - Métodos e Documentos, Tomo 2, INIC/CLUL, Lisboa.
[1994] Project Eureka Genelex-Consortium Genelex. Rapport sur la Couche Morphologique. Version 3.3, 2 Novembre 1994.
PAROLE documents:
[Medeiros, Santos, Wittmann 1995] José Carlos Medeiros, Diana Santos, Luzia Wittmann. MLAP PAROLE-WP4-Task 2.2a. "Portuguese Lexicon: Inflectional Morphology. On the compatibility of the Portuguese lexicon Palavroso and the Genelex model at the morphological level".
Appendix - state of validation by the AlethGD tool
General counting array
|
Nb of Morphological Units |
20435 |
|
Nb of Simple Morphological Units |
19782 |
|
Nb of Graphical Morphological Units |
19936 |
|
Number of simple words inflexion modes |
378 |
|
Number of compound words |
653 |
|
Number of compound words inflexion modes |
180 |
|
Number of syntactic Units |
23218 |
|
Number of constructions |
475 |
Full counting array
Morphological layer
|
dico_Um_S |
19782 |
|
dico_Um_Aff |
0 |
|
dico_Um_Agg |
90 |
|
dico_Um_C |
653 |
|
dico_Umg |
19936 |
|
dico_Ump |
0 |
|
dico_Radg |
863 |
|
dico_Radp |
0 |
|
dico_Mfg |
378 |
|
dico_Mfp |
0 |
|
dico_Mfc |
180 |
|
dico_CombTM |
124 |
|
dico_Comb_Comb |
182 |
|
dico_Etymon |
0 |
|
dico_Trait_M |
179 |
|
dico_Trait_F |
0 |
|
dico_CatGram |
16 |
|
dico_SsCatGram |
67 |
|
dico_Statut |
5 |
|
dico_Separg |
11 |
|
dico_Separp |
7 |
|
dico_TypeBref |
4 |
Syntactic layer
|
dico_Usyn |
23218 |
|
dico_Description |
604 |
|
dico_Self |
54 |
|
dico_IntervConst |
54 |
|
dico_ComportAppele |
0 |
|
dico_Optionnalite |
0 |
|
dico_Construction |
475 |
|
dico_Position_C |
332 |
|
dico_Position_S |
0 |
|
dico_Insertion |
0 |
|
dico_Syntagme_T |
55 |
|
dico_Syntagme_NT_C |
169 |
|
dico_Syntagme_NT_S |
0 |
|
dico_MdC |
0 |
|
dico_TransfDescription |
0 |
|
dico_ModifConstruction |
0 |
|
dico_ModifPosition |
0 |
|
dico_TransfSyntagme |
0 |
|
dico_ModifSyntagme_T |
0 |
|
dico_ModifSyntagme_NT |
0 |
|
dico_ModifIntervConst |
0 |
|
dico_Trait_Lex |
28 |
|
dico_Trait_RefLex |
0 |
|
dico_Trait_S_Fermes |
277 |
|
dico_Trait_Aux |
0 |
|
dico_Trait_Bin |
0 |
|
dico_Trait_Libre |
41 |
|
dico_Fonction |
37 |
|
dico_RoleTh |
11 |
|
dico_TransfUsyn |
0 |
|
dico_EtiquetteSynt_T |
16 |
|
dico_EtiquetteSynt_NT |
10 |
|
dico_Frame_Set |
132 |
Grammatical Categories array
|
Category |
Subcategory |
Ums |
Umc |
Total |
Example |
|
WITHOUTC |
WITHOUTSC |
195 |
0 |
195 |
cima |
|
NOUN |
COMMON |
11305 |
329 |
11634 |
zaire |
|
NOUN |
PROPER |
429 |
103 |
532 |
Barbuda |
|
VERB |
AUX |
2 |
0 |
2 |
haver |
|
VERB |
MAIN |
2998 |
0 |
2998 |
abafar |
|
ADJECTIVE |
WITHOUTSC |
3218 |
22 |
3240 |
nascido |
|
PRONOUN |
DEMONSTRATIVE |
20 |
0 |
20 |
tal |
|
PRONOUN |
POSSESSIVE |
10 |
0 |
10 |
vosso |
|
PRONOUN |
INTERROGATIVE |
6 |
0 |
6 |
quem |
|
PRONOUN |
EXCLAMATIVE |
6 |
0 |
6 |
quem |
|
PRONOUN |
RECIPROCAL |
3 |
0 |
3 |
vos |
|
PRONOUN |
REFLEXIVE |
10 |
0 |
10 |
consigo |
|
PRONOUN |
PERSONAL |
28 |
0 |
28 |
vocês |
|
PRONOUN |
RELATIVE |
9 |
0 |
9 |
quem |
|
PRONOUN |
INDEFINITE |
40 |
5 |
45 |
várias |
|
ADVERB |
WITHOUTSC |
529 |
68 |
597 |
abaixo |
|
ADPOSITION |
WITHOUTSC |
27 |
40 |
67 |
sobre |
|
CONJUNCTION |
COORDINATIVE |
21 |
3 |
24 |
todavia |
|
CONJUNCTION |
SUBORDINATIVE |
18 |
27 |
45 |
que |
|
NUMERAL |
ORDINAL |
33 |
0 |
33 |
bilionésimo |
|
NUMERAL |
CARDINAL |
43 |
0 |
43 |
um |
|
DETERMINER |
DEMONSTRATIVE |
17 |
0 |
17 |
mesma |
|
DETERMINER |
POSSESSIVE |
10 |
0 |
10 |
vosso |
|
DETERMINER |
INTERROGATIVE |
6 |
0 |
6 |
quem |
|
DETERMINER |
EXCLAMATIVE |
6 |
0 |
6 |
quem |
|
DETERMINER |
INDEFINITE |
40 |
5 |
45 |
várias |
|
ARTICLE |
INDEFINITE |
2 |
0 |
2 |
uma |
|
ARTICLE |
DEFINITE |
2 |
0 |
2 |
o |
|
INTERJECTION |
WITHOUTSC |
35 |
1 |
36 |
ah |
|
RESIDUAL |
ACRONYM |
145 |
0 |
145 |
ACP |
|
RESIDUAL |
FOREIGN |
360 |
50 |
410 |
libris |
|
RESIDUAL |
SYMBOL |
146 |
0 |
146 |
=B0 |
|
RESIDUAL |
ABBREVIATION |
62 |
0 |
62 |
A. |
|
UNIQUE |
MEDIOPASSIVE |
1 |
0 |
1 |
se |