LE-PAROLE

Portuguese Lexicon Documentation

 

Institutions:

Centro de Linguística da Universidade de Lisboa (CLUL)

Instituto de Engenharia e Sistemas de Computadores (INESC) =96 Natural Language Group

 

 

1. General Design Information

This section presents a brief overview of the information present in the Portuguese PAROLE lexicon.

The selection of the lexical material was based on the frequency of the textual items in a subcorpus of 5 thousand words. This subcorpus was extracted from the CRPC - Corpus de Referência do Português Contemporâneo (a project of CLUL), with the following constitution:

- newspaper: 44%

- periodical: 16%

- book: 38%

- law: 2%

The corpus was lemmatized with the INESC's morphological analyzer Palavroso (Medeiros, 1995) and a frequency list of lemmas was extracted. All lemmas with frequency superior to the threshold established were considered and manually checked. About 18 thousand lemmas were selected from the subcorpus, but the percentages per grammatical category were not conformant to the percentages requested in the project. Thus some adjustments were done, including the addition of two thousand nouns extracted from the Palavroso Analyzer dictionary.

Some tematic words from Fundamental Portuguese (Bacelar et al., 1987), which did not attained the frequency level in the corpus, were added.

Afterwards, the frequency of the words gathered was checked in a 15 million running words corpus.

All the grammatical words and numerals were added separately, based on Palavroso dictionary.

As proper nouns only toponyms and antroponyms were considered.

For the selection of abbreviations and acronyms, the list of Palavroso was complemented by a selection of Parole corpus occurrences and by suggestions from the lexicographers.

 

2. Current Lexicon Contents

In this section we present factual data about the Portuguese lexicon in the Parole DTD conformant format.

 

2.1 Morphological layer

The morphological information (classification and paradigms) was converted from Palavroso lexicon. To represent the information embedded in Palavroso in the Genelex model was not a simple operation. As was demonstrated in the MLAP Parole report about the compatibility of our lexicon and the Genelex model (Medeiros et al., Oct./95), Palavroso is a rule based analyzer whereas Genelex uses a paradigmatic approach. This means that in Palavroso, the rule component is primary, and the lexicon seen as additional information. While in Genelex model lemmas have links to inflectional models, in Palavroso, rules and lemmas do not have explicit links between them. Thus, to obtain CIFs, the analysis rules should be somehow inverted. The solution adopted was to develop a generation module for the Palavroso system, able to generate and convert simultaneously into SGML format. Some special procedures were necessary, namely to handle verb forms when accompanied by clitics.

The generator was used to introduce verbs (including special forms when accompanied by clitics), nouns, adjectives, adverbs and numerals in the tool. Grammatical words, Symbols, Proper Nouns, Accronyms, Abbreviations, Foreign words. The tool generates the data into normal files, which are then loaded into AlethGD, using the standard fillers. Compounds, agglutinated forms and non autonomous words were introduced interactively

The general principles to identify the Morphological Units are coincident with the GENELEX principles, as specified in the "Raport sur la Couche Morphologique". There was one different criterion in the source data to be converted from Palavroso. In Palavroso, different gender of nouns always originates two entries. Therefore, all nouns of Parole lexicon were manually revised in order to consider only one MU when the gender variation reflects only the sex of the referent (menino/menina). With this adaptation the whole set of criteria is conformant to the GENELEX choices.

 

General counting array:

Nb of Morphological Units

20435

Nb of Simple Morphological Units

19782

Nb of Graphical Morphological Units

19936

Number of simple words inflexion modes

378

Number of compound words

653

Number of compound words inflexion modes

180

 

Grammatical Categories array

Category

Subcategory

Ums

Umc

Total

Example

WITHOUTC

WITHOUTSC

195

0

195

cima

NOUN

COMMON

11305

329

11634

zaire

NOUN

PROPER

429

103

532

Barbuda

VERB

AUX

2

0

2

haver

VERB

MAIN

2998

0

2998

abafar

ADJECTIVE

WITHOUTSC

3218

22

3240

nascido

PRONOUN

DEMONSTRATIVE

20

0

20

tal

PRONOUN

POSSESSIVE

10

0

10

vosso

PRONOUN

INTERROGATIVE

6

0

6

quem

PRONOUN

EXCLAMATIVE

6

0

6

quem

PRONOUN

RECIPROCAL

3

0

3

vos

PRONOUN

REFLEXIVE

10

0

10

consigo

PRONOUN

PERSONAL

28

0

28

vocês

PRONOUN

RELATIVE

9

0

9

quem

PRONOUN

INDEFINITE

40

5

45

várias

ADVERB

WITHOUTSC

529

68

597

abaixo

ADPOSITION

WITHOUTSC

27

40

67

sobre

CONJUNCTION

COORDINATIVE

21

3

24

todavia

CONJUNCTION

SUBORDINATIVE

18

27

45

que

NUMERAL

ORDINAL

33

0

33

bilionésimo

NUMERAL

CARDINAL

43

0

43

um

DETERMINER

DEMONSTRATIVE

17

0

17

mesma

DETERMINER

POSSESSIVE

10

0

10

vosso

DETERMINER

INTERROGATIVE

6

0

6

quem

DETERMINER

EXCLAMATIVE

6

0

6

quem

DETERMINER

INDEFINITE

40

5

45

várias

ARTICLE

INDEFINITE

2

0

2

uma

ARTICLE

DEFINITE

2

0

2

o

INTERJECTION

WITHOUTSC

35

1

36

ah

RESIDUAL

ACRONYM

145

0

145

ACP

RESIDUAL

FOREIGN

360

50

410

libris

RESIDUAL

SYMBOL

146

0

146

=B0

RESIDUAL

ABBREVIATION

62

0

62

A.

UNIQUE

MEDIOPASSIVE

1

0

1

se

 

2.2 Syntactic layer

Similarly to the section about morphology, here we give a general overview of the method used to identify Syntactic Units and Syntactic Descriptions:

The splitting criteria to encode several syntactic units were:

- different syntactic functions for the same number of positions;

- different thematic roles for the same syntactic function;

- optionality;

- possibility of establishing relations in a frameset.

The following choices were made in relation with optionality or alternatives of realizations:

- those described in the model, i. e., whenever a position is optional it is not described, unless the optionality depends on the context: then a different syntactic unit will be considered and the context specified;

- if a position can be filled by different categories having the same syntactic function, they appear in the same syntactic unit under the relevant position;

- when the syntactic function is different a new syntactic unit will be considered.

The frameset construction strategy is used whenever:

- an argument can occur in different positions and with different syntactic functions (ex. subject of an intransitive construction versus object of a transitive one) thus needing to be related;

- a verb allows for a raising construction, having as an object or as a subject an argument that can also appear as an argument of an embedded clause (where it really belongs to), thus needing also to be related;

List of syntactic functions used for Portuguese:

- subject

- object

- indirect object

- oblique

- adjunct

- complement predicative of the subject

- complement predicative of the object

- v_modifier

- n_of_comp

- n_prep_comp

- n_apposition

- n_adjunct

- n_clausal_comp

- n_determinative

- n_attributive

- n_modifier

Strategy to describe control information:

- we distinguish control constructions from the raising ones. The latter are described through a frameset construction, which has a "raising" feature, and through the assignment of the same coref feature to the different positions and functions the argument can show. Control constructions also have one feature (subject_control, object_control...); Controller and Controllee share the same Coref feature. The Controllee occupies a position without an overt realization (PRO).

Number of syntactic Units

23218

Number of constructions

475

 

 

3. Bibliography

General:

[Medeiros 1995] José Carlos Medeiros. Processamento Morfológico e Correcção Ortográfica do Português. Tese de Mestrado, Instituto Superior Técnico - Universidade Técnica de Lisboa, Lisboa.

[Nascimento, Rivenc, Cruz 1987] Maria Fernanda B. do Nascimento, Paul Rivenc, M. Luísa Segura da Cruz. Português Fundamental - Métodos e Documentos, Tomo 2, INIC/CLUL, Lisboa.

[1994] Project Eureka Genelex-Consortium Genelex. Rapport sur la Couche Morphologique. Version 3.3, 2 Novembre 1994.

 

PAROLE documents:

[Medeiros, Santos, Wittmann 1995] José Carlos Medeiros, Diana Santos, Luzia Wittmann. MLAP PAROLE-WP4-Task 2.2a. "Portuguese Lexicon: Inflectional Morphology. On the compatibility of the Portuguese lexicon Palavroso and the Genelex model at the morphological level".

 

Appendix - state of validation by the AlethGD tool

 

General counting array

Nb of Morphological Units

20435

Nb of Simple Morphological Units

19782

Nb of Graphical Morphological Units

19936

Number of simple words inflexion modes

378

Number of compound words

653

Number of compound words inflexion modes

180

Number of syntactic Units

23218

Number of constructions

475

 

Full counting array

Morphological layer

dico_Um_S

19782

dico_Um_Aff

0

dico_Um_Agg

90

dico_Um_C

653

dico_Umg

19936

dico_Ump

0

dico_Radg

863

dico_Radp

0

dico_Mfg

378

dico_Mfp

0

dico_Mfc

180

dico_CombTM

124

dico_Comb_Comb

182

dico_Etymon

0

dico_Trait_M

179

dico_Trait_F

0

dico_CatGram

16

dico_SsCatGram

67

dico_Statut

5

dico_Separg

11

dico_Separp

7

dico_TypeBref

4

 

Syntactic layer

dico_Usyn

23218

dico_Description

604

dico_Self

54

dico_IntervConst

54

dico_ComportAppele

0

dico_Optionnalite

0

dico_Construction

475

dico_Position_C

332

dico_Position_S

0

dico_Insertion

0

dico_Syntagme_T

55

dico_Syntagme_NT_C

169

dico_Syntagme_NT_S

0

dico_MdC

0

dico_TransfDescription

0

dico_ModifConstruction

0

dico_ModifPosition

0

dico_TransfSyntagme

0

dico_ModifSyntagme_T

0

dico_ModifSyntagme_NT

0

dico_ModifIntervConst

0

dico_Trait_Lex

28

dico_Trait_RefLex

0

dico_Trait_S_Fermes

277

dico_Trait_Aux

0

dico_Trait_Bin

0

dico_Trait_Libre

41

dico_Fonction

37

dico_RoleTh

11

dico_TransfUsyn

0

dico_EtiquetteSynt_T

16

dico_EtiquetteSynt_NT

10

dico_Frame_Set

132

 

Grammatical Categories array

Category

Subcategory

Ums

Umc

Total

Example

WITHOUTC

WITHOUTSC

195

0

195

cima

NOUN

COMMON

11305

329

11634

zaire

NOUN

PROPER

429

103

532

Barbuda

VERB

AUX

2

0

2

haver

VERB

MAIN

2998

0

2998

abafar

ADJECTIVE

WITHOUTSC

3218

22

3240

nascido

PRONOUN

DEMONSTRATIVE

20

0

20

tal

PRONOUN

POSSESSIVE

10

0

10

vosso

PRONOUN

INTERROGATIVE

6

0

6

quem

PRONOUN

EXCLAMATIVE

6

0

6

quem

PRONOUN

RECIPROCAL

3

0

3

vos

PRONOUN

REFLEXIVE

10

0

10

consigo

PRONOUN

PERSONAL

28

0

28

vocês

PRONOUN

RELATIVE

9

0

9

quem

PRONOUN

INDEFINITE

40

5

45

várias

ADVERB

WITHOUTSC

529

68

597

abaixo

ADPOSITION

WITHOUTSC

27

40

67

sobre

CONJUNCTION

COORDINATIVE

21

3

24

todavia

CONJUNCTION

SUBORDINATIVE

18

27

45

que

NUMERAL

ORDINAL

33

0

33

bilionésimo

NUMERAL

CARDINAL

43

0

43

um

DETERMINER

DEMONSTRATIVE

17

0

17

mesma

DETERMINER

POSSESSIVE

10

0

10

vosso

DETERMINER

INTERROGATIVE

6

0

6

quem

DETERMINER

EXCLAMATIVE

6

0

6

quem

DETERMINER

INDEFINITE

40

5

45

várias

ARTICLE

INDEFINITE

2

0

2

uma

ARTICLE

DEFINITE

2

0

2

o

INTERJECTION

WITHOUTSC

35

1

36

ah

RESIDUAL

ACRONYM

145

0

145

ACP

RESIDUAL

FOREIGN

360

50

410

libris

RESIDUAL

SYMBOL

146

0

146

=B0

RESIDUAL

ABBREVIATION

62

0

62

A.

UNIQUE

MEDIOPASSIVE

1

0

1

se