Multext-East - Deliverable D1.2

Multext-East - Deliverable D1.2. Language-specific resources. May 96.

logo
Multext-East
Language-specific resources

Copernicus project COP 106
Deliverable D1.2 - May 1996

Credits

Workpackage Coordinator:: Nancy Ide (ide@cs.vassar.edu)
Contributors:: Bulgarian L. Dimitrova, L. Sinapova, K. Simov, D. Popov, Sv. Manova-Vidinska
Czech: V.Petkevic, J.Klímová and V.Schmiedtová
Estonian: H.J.Kaalep, E.Toomsalu
Hungarian: C.Oravecz and L.Tihanyi
Romanian: S.Bruda, C.Diaconu, L.Diaconu, and D.Tufis
Slovene: T.Erjavec, P.Holozan and M.Romih

A.Introduction

This document describes the resources for text segmentation and the lexicons developed in Task 1.2., Language-specific resources. Each partner has developed a set of resource files required by the MULTEXT segmentation tools for their language and a lexicon consisting of the following:

Language Lemmas Word forms
Bulgarian 17567 295431
Czech 15703 155649
Estonian 42803 118482
Hungarian 7644 25081
Romanian 38814 1293226
Slovene 10730 26197

All lexicons were developed according to the specifications provided by Task 1.1 and are in the MULTEXT format.

Language	Lemmas	Word forms
Bulgarian	17567	295431
Czech	15703	155649
Estonian	42803	118482
Hungarian	7644	25081
Romanian	38814	1293226
Slovene	10730	26197

B. Segmenter resources

Some of the subtools in the segmenter require certain language-specific information in order to accomplish their tasks. For maximum flexibility and to retain language-independence, all such information is provided directly to the subtools via external resource files. Each partner has developed a set of resource files required by the segmentation tools for their language.
All partners refined the contents of the segmenter resource files for their languages via a series of steps of step-wise refinement, in which the segmenter was tested using Orwell's "1984" as input, results sent to the partners and checked, and revisions provided to the resource files on the basis of the output.
All resource files have names in the following form:
tbl.nnnnn.xx
where 'nnnnn' is replaced by a specific file name identifying the contents, and 'xx' is the two-letter ISO standard code [ISO 639:1988] for the language:

bg Bulgarian

cs Czech

et Estonian

hu Hungarian

ro Romanian

sl Slovenian

The following specific resource files were provided by each partner:

tbl.punct.xx
The file tbl.punct.xx contains the definition of those characters and character configurations that are to be considered as punctuation. They are defined in a regular expression format and each one is assigned an appropriate class, such as "internal punctuation", "non-breaking punctuation", etc.
tbl.comppunct.xx
The file tbl.comppunct.xx contains complex punctuation which includes space characters (such as ". . .").
tbl.abbrev.xx
The file tbl.abbrev.xx contains abbreviations ending with a period for the language in question. It is used by the module of the segmenter called mtsegnabbrev. Some abbreviations are identified as belonging to special classes, such as "title", "initial", etc.
tbl.compabbrev.xx
The file tbl.compabbrev.xx contains space-composed abbreviations (space replaced by underscore), if they exist.
tbl.clitics.xx
The file tbl.clitics.xx contains "clitics" for the language. These are not necessarily true clitics; for our purposes, clitics are regarded as anything separated by a hyphen or an apostrophe. Each is identified as belonging to a class called either PROCLITIC (i.e., proclitic, for items appearing at the beginning of the string) or ENCLITIC (i.e., enclitic, for items appearing at the end of the string). These class names are associated with the identified tokens in the output of the mtsegclitics module of the segmenter.
tbl.compound.xx
The file tbl.compound.xx contains multi-word units which need to be re-combined. Orthographic words which are separated by blanks are split into separate tokens by one of the segmenter's subtools which is invoked early in the chain; this file indicates when such words should be regarded as a single token comprising a compound word.

C. Lexicons

C.1. Format and notation

The MULTEXT-EAST lexicons use the MULTEXT lexical description format, which consists of linear strings of characters representing the morphosyntactic information to be associated with word-forms. The string is constructed following the philosophy of the Intermediate Format proposed in the EAGLES Corpus proposal (Leech and Wilson, 1994), i.e. of having agreed symbols in predefined and fixed positions: the positions of a string of characters are numbered 0, 1,2, etc. in the following way:

the agreed character at position 0 encodes part-of-speech;
each character at position 1, 2, n, encodes the value of one attribute (person, gender, number, etc.);
if an attribute does not apply, the corresponding position in the string contains a special marker, in our case `-' (hyphen).

Example: Ncms- means (noun,common,masculine,singular,nocase)
This notation adopts the EAGLES Intermediate Format with a small revision: the Intermediate Format encodes information by means of digits, while in MULTEXT characters of a mnemonic nature are preferred.
It is worth noting here that this representation is proposed for word-form lists which will be used for a specific application, i.e. corpus annotation. We have foreseen these lexical descriptions as containing a full description of lexical items. As noted above, the sets of tags, to be used properly for automatic corpus annotation tools, are expected to contain less information.
These lexical descriptions can be seen as notational variants of the feature-based notation in the form of attribute-value pairs. In fact, the string notation proposed, e.g. Ncms- is completely synonymous to a feature-structure representation:

+- -+ | Cat: Noun | | Type: common | | Gender: masculine | | Number: singular | +- -+
Formal characteristics relevant for our applications have been kept. Use of position in the string to encode attributes makes no restrictions on the set of characters to be used as values. It could then be inferred that, if we wanted to keep the formal characteristic of order independent notation, we would have to make sure that the characters meant to represent attribute-values are not ambiguous. As attributes and values are linked by positional criteria, the need of a special marker for void attribute-value pairs is evident if we want to keep descriptions coherent. Thus, the "Ncms-" style can be viewed as a short-hand notation convenient for some users and straightforwardly mappable to the information used in unification-based attribute-value pairs formalisms.

C.2. Bulgarian lexicon

1. Number of lemmas and word forms

Lemmas Entries
Nouns 9891 47969
(masculine 4180 25100)
(feminine 4120 16493)
(neuter 1591 6376)
Verbs 4140 226666
Adjectives 2155 19397
Pronouns 92 110
Adverbs 790 790
Adpositions 98 98
Conjunctions 76 76
Numerals 67 67
Interjections 172 172
Particles 86 86
Total 17567 295431

2. Means of creation
At the beginning of our work we had two machine readable dictionaries of Bulgarian Language: (1) "Bulgarian explanatory dictionary" fourth edition 1994; (2) "Orthographic (spelling) dictionary of Bulgarian language" first edition 1983. From the first dictionary we extract the main words with an appropriate grammatical information: noun, masculine; verb, transitive, perfect ... . The dictionary entries in the second dictionary contain only the base form of each lemma with attached set of word forms that characterise the paradigm of a given lemma. This information is enough for human users to generate the whole paradigm of each lemma.
We have constructed a set of rules for explication of this implicit information. The rules were encoded in a program and run on the mixed vocabulary of the two dictionaries. The result is a list of lemmas with enough information for automatic generation of all word forms. From this list the lexicon of Bulgarian part of the MULTEXT-EAST project was extracted.
The Bulgarian MTE lexicon mostly covers the available texts (Orwell and so on). We constructed a program that attaches the MTE lexical descriptions to each word form generated by the list of lemmas.

	Lemmas	Entries
Nouns	9891	47969
(masculine	4180	25100)
(feminine	4120	16493)
(neuter	1591	6376)
Verbs	4140	226666
Adjectives	2155	19397
Pronouns	92	110
Adverbs	790	790
Adpositions	98	98
Conjunctions	76	76
Numerals	67	67
Interjections	172	172
Particles	86	86
Total	17567	295431

C.3. Czech lexicon

1. Number of lemmas and word forms

Lemmas Wordforms
Nouns: 6889 45602
Verbs: 3604 16685
Adjectives: 4038 91378
Pronouns: 33 541
Determiners: 0 0 (irrelevant for Czech)
Articles: 0 0 (irrelevant for Czech)
Adverbs: 748 900
Adpositions: 41 47
Conjunctions: 38 46
Numerals: 19 127
Interjections: 0 0
Residual: 0 0
Abbreviations: 260 290
Particles: 33 33
Total: 15703 155649

2. Means of creation
Authors of the word form lexicon:

Jan Hajic (Faculty of Mathematics and Physics, Charles University)

Vladimir Petkevic (Faculty of Philosophy, Charles University)

The Word Form lexicon was elaborated as follows. All the word forms from all three MTE corpora have been submitted to the Czech morphological analyzer. This analyzer assigns each word form its morphosyntactic information in the form of a string of characters which is similar to that used in MULTEXT-EAST. Thus, conversion tables between the Czech analyzer format and that of MULTEXT-EAST had to be written in which correspondences of the formats were reflected. Then the conversion program was written. However, in addition to this there are important differences and problems which had to be solved. The Czech analyzer did not properly account for such parts of speech as pronouns, prepositions, conjunctions, and in some cases the MULTEXT-EAST morphosyntactic information was more specific than that accounted for in the Czech morphological analyzer. Special table of these parts of speech and irregulars was written. Moreover, it was not always easy to mark the lemma by '=' because eg. for nouns the lemma in Czech need not be necessarily masculine singular (normally a default). Moreover, some words encountered in the three corpora were not yet inclued in our electronic internal lexicon for Czech, so the lexicon had to be extended by these words. This was due to the fact that the corpora chosen for the Czech MTE monolingual corpus were lexically very rich not only in, say, common nouns but also in proper nouns (opera characters in the fiction corpus, for instance).
After the word form lexicon was eventually developed and processed by a sequence of various conversion programs it was validated by Tomaz' Erjavec's validating tools (mtems-expand, lexmsd) which are available for the MTE partners at the Slovene www site (Tomaz Erjavec' support was much appreciated by the Czech partner). Some errors and bugs have been detected by T. Erjavec's validator but after approx. 3 tests the lexicon was fully conformant with the latest version of the lexical specifications.
The lexicon contains more than required 15000 lemmas (see above). However, the three MTE Czech corpora processed contain more lemmas and word forms than the lexicon which was delivered. So a choice of word forms and lemmas to be included in the WF lexicon has been performed. The lexicon can be extended by automatic and semiautomatic means (new words had and have to be included in the internal lexicon for Czech).
Having in mind extreme richness of Czech morphology which is often very intricate and irregular (most complicated in the family of Slavic languages) the preparation of the word form lexicon is considered by the Czech partner as the most difficult task within the MTE project so far. Therefore the task could be accomplished only hours before the prescribed deadline.
Notes:
(a) No strings belonging to the class Residual have been included into the WFLexicon but they can be added almost without problems.
(b) The lexicon is in the process of constant extending.

	Lemmas	Wordforms
Nouns:	6889	45602
Verbs:	3604	16685
Adjectives:	4038	91378
Pronouns:	33	541
Determiners:	0	0 (irrelevant for Czech)
Articles:	0	0 (irrelevant for Czech)
Adverbs:	748	900
Adpositions:	41	47
Conjunctions:	38	46
Numerals:	19	127
Interjections:	0	0
Residual:	0	0
Abbreviations:	260	290
Particles:	33	33
Total:	15703	155649

C.4. Estonian lexicon

1. Number of lemmas and word forms

Lemmas Wordforms
Nouns N 29816 73228
Verbs V 3133 18253
Adjectives A 10949 22542
Numerals M 137 561
Pronouns P 60 926
Adpositions S 177 177
Conjunctions C 23 23
Interjections I 94 94
Adverbs R 2678 2678
Together 42803 118482

The lemmas were counted using a series of scripts like one in Appendix 1.

2. Means of creation
The lexicon was created semi-automatically. The basis for creating it was a corpus of 450,000 words from Estonian written texts from 1985 that are included in the Base Corpus of Estonian Literary Language, created at the University of Tartu. The corpus for creating the lexicon contains 150 kW of newspapers, 150 kW of fiction (both of which contain the 100 kW MTE corpora) and 150 kW of science.
First a frequency list of the wordforms was made. Then non-words (numbers, 1-letter words, acronyms and abbreviations) were deleted from the list and it was run through a morphology analyser from a company called "Filosoft". Then the output was transformed to conform to the MTE specifications. The transformation was done semi-automatically, using some UNIX scripts specifically written for this task.
It was thought this would be better than to generate all the possible wordforms and omit the compounds and derivations. The way the lexicon was created in a way imitates a situation when you have no morphology analyser but have some people able to link wordforms with lemmas.
As a result, the lexicon:

contains 43kW of lemmas and 118kW of wordforms

should cover 95% of the wordforms met in '1984', MTE fiction and newspaper corpora

contains compound words and derivations which normally would not be a part of the lexicon

the complete paradigms for all the lemmas are not represented

Bugs and delicate matters:

For some reason, there are about 30 doubled entries in the lexicon. These should be deleted.

The lexicon does not contain compound words with hyphens. Some of these words can be analyzed as separate simplex words, but some cannot.

The lexicon does not contain adjectives which are homonymous with geographical names, differing from the latter only in that they begin with a small letter, eg. Aasia (Engl. Asia) -- aasia (Engl. asian).

	Lemmas	Wordforms
Nouns N	29816	73228
Verbs V	3133	18253
Adjectives A	10949	22542
Numerals M	137	561
Pronouns P	60	926
Adpositions S	177	177
Conjunctions C	23	23
Interjections I	94	94
Adverbs R	2678	2678
Together	42803	118482

C.5. Hungarian lexicon

1. Number of lemmas and word forms
The text contained 81.167 words. There were 25.081 different forms.

Lemmas Wordforms
Nouns N 2526 10591
Verbs V 664 7069
Adjectives A 3228 5508
Numerals M 117 294
Pronouns P 78 483
Adpositions S 55 106
Conjunctions C 50 50
Interjections I 23 23
Adverbs R 863 880
Article T 1 4
Abbreviations Y 1 5
Residuals X 38 68
Together 7644 25081

2. Means of creation
Since sufficient morphological analysis for the Hungarian language with list-based dictionaries cannot be made, we had to follow a different way while we tried to keep the maximal compatibility with Multext policy.
For this we made two important steps.
In the first we eliminated the derivative and compound forms. This process is very straightforward in Hungarian. We took the constituent morphemes of the derivative or compound form, and created a new word by merging these constituents into one new word and giving it the word class of the rightmost constituent. This way we reduced the number of segmentation sequences to an acceptable limit which could be described in the framework of MULTEXT -EAST lexical specification. So we reached the compatibility with the MULTEXT lexical specificationss by reducing segmented constituents to only Stem+Suffix combinations.
But the number of the different Stem+suffix combiniations is still too high to list them in a dictionary. Important: with the reduction of the prefix+stem+derivation to new stems we created enourmous number of new elements which now should be listed in the dictionary. For the exact number see Appendix 2. The Number of Hungarian Word Forms:

... from an average verb we can create 540 other verbs and 2160 nominal forms. With all of its suffixes a verb can have more than 2 million forms...

To reach as much compatibility as possible with the morphological process we decided on the followings: We extended the analysis of Hungarian with another step.
In this step a Hungarian morphological analyzer (HUMOR, delivered by the subcontractor) processes the text and then a tool converts the results to a MULTEXT dictionary. After it may come the Multext morphological analyzer which runs using the dictionary created in the previous step and dedicated specially to the current text. If we create the appropriate dictionary and supply it with the text (e.g. for 1984) than the analysis can be carried out at places where the additional tool is not available or where the extension of the process is unwanted.

	Lemmas	Wordforms
Nouns N	2526	10591
Verbs V	664	7069
Adjectives A	3228	5508
Numerals M	117	294
Pronouns P	78	483
Adpositions S	55	106
Conjunctions C	50	50
Interjections I	23	23
Adverbs R	863	880
Article T	1	4
Abbreviations Y	1	5
Residuals X	38	68
Together	7644	25081

C.6. Romanian lexicon

Dan Tufi&scedil;
Center for Research in Machine Learning, Natural Language Processing and Conceptual Modelling - RACAI, Bucharest (contractor)

Lidia Diaconu, C&abreve;lin Diaconu, Ana-Maria Barbu, Camelia Popescu, Ileana Letinu
Research Institute for Informatics - ICI, Bucharest (subcontractor)

There are two dictionaries for Romanian. The first one is the proper dictionary,containing words likely to be found in usual written texts. The second one, is the dictionary of invented words (newspeek) appearing in the Orwell's book "1984", which is the text used for building the multilingual corpus. In the statistics provided below, the newspeek "words" are shown as additional (+ k). For instance the lines:

Adjectives (A) Lemmas Wordforms
11078+24 347128+128

are to be interpreted as follows: there are 11078 lemmas for normal Romanian adjectives which produced 347128 inflected forms plus 24 newspeek Romanian adjectives which produced other 128 additional wordforms.
1. Number of lemmas and word forms
The statistical data on the Romanian lexicon (including the newspeak words) is given below:

Adjectives (A) 11078+24 347128+128
Conjunctions (C) 59 121
Determiners (D) 60 890
Interjections (I) 182+3 187+3
Numerals (M) 98 1622
Nouns (N) 17398+36 322284+142
Pronouns (P) 80+1 1630+2
Particles (Q) 6 14
Adverbs (R) 688+2 1247+6
Adpositions (S) 57+2 89+4
Articles (T) 12 79
Verbs (V) 9084+3 617921+12
Residuals (X) 12 14
Total 38814+71 1293226+297

2. Means of creation
The Multext-East lexicon has been created by means of an unification-based linguistic processing environment (mac-ELU) which was developed by us in cooperation with ISSCO-Geneva. We developed a large unification-based lexicon (about 20.000 entries).
In the Appendix 4, we provide a mac-ELU (partial) description of the Romanian dictionary encoding.

Adjectives (A)	Lemmas	Wordforms
	11078+24	347128+128

Adjectives (A)	11078+24	347128+128
Conjunctions (C)	59	121
Determiners (D)	60	890
Interjections (I)	182+3	187+3
Numerals (M)	98	1622
Nouns (N)	17398+36	322284+142
Pronouns (P)	80+1	1630+2
Particles (Q)	6	14
Adverbs (R)	688+2	1247+6
Adpositions (S)	57+2	89+4
Articles (T)	12	79
Verbs (V)	9084+3	617921+12
Residuals (X)	12	14
Total	38814+71	1293226+297

C.7. Slovenian

1. Number of lemmas and word forms

Category Entrs WFSs Lms = MSDs
Nouns (N): 24205 10384 4407 2747 88
Verbs (V): 15337 9285 3053 832 114
Adjectives (A): 40682 7930 3176 1039 245
Pronouns (P): 1991 438 78 59 976
Adverbs (R): 1818 1807 1342 360 3
Adpositions (S): 120 106 44 77 6
Conjunctions (C): 37 37 1 37 2
Numerals (M): 1359 195 6 1 139
Interjections (I): 6 6 1 6 1
Residuals (X): 0 0 0 0 0
Abbreviations (Y): 17 17 1 17 1
Particles (Q): 61 61 1 61 1
TOTAL (*): 85633 26197 10730 5236 1576

Entrs: Number of entries
WFSs: Number of distinct word-forms
Lms: Number of distinct lemmas
= : Number of '=' lemmas
MSDs: Number of distinct morphosyntactic descriptions

2. Means of creation
The lexicon is produced for the Ljubljana partner by the subcontractor Amebis d.o.o. from their proprietary lexical database.
The lexicon was created automatically with expanding the Amebis database entries and mapping their descriptions into the MULTEXT-EAST V2 morphosyntactic descriptions.
The Slovene word-form lexicon currently covers the non-idiosyncratic word-forms appearing the in Slovene 1984 and Fiction corpus. Thus Newspeak and uncommon proper names are not included in the lexicon; furthermore, only word-forms actually appearing in the corpus are included, and not the complete paradigms of the lexicon lemmas.

Category	Entrs	WFSs	Lms	=	MSDs
Nouns (N):	24205	10384	4407	2747	88
Verbs (V):	15337	9285	3053	832	114
Adjectives (A):	40682	7930	3176	1039	245
Pronouns (P):	1991	438	78	59	976
Adverbs (R):	1818	1807	1342	360	3
Adpositions (S):	120	106	44	77	6
Conjunctions (C):	37	37	1	37	2
Numerals (M):	1359	195	6	1	139
Interjections (I):	6	6	1	6	1
Residuals (X):	0	0	0	0	0
Abbreviations (Y):	17	17	1	17	1
Particles (Q):	61	61	1	61	1
TOTAL (*):	85633	26197	10730	5236	1576

Multext-East Language-specific resources

Credits

Contents

Multext-East
Language-specific resources