Multext-East - Deliverable D1.2. Language-specific resources. May 96.







logo

Multext-East
Language-specific resources

Copernicus project COP 106
Deliverable D1.2 - May 1996








Credits

Workpackage Coordinator:
Nancy Ide (ide@cs.vassar.edu)
Contributors:
Bulgarian L. Dimitrova, L. Sinapova, K. Simov, D. Popov, Sv. Manova-Vidinska
Czech: V.Petkevic, J.Klímová and V.Schmiedtová
Estonian: H.J.Kaalep, E.Toomsalu
Hungarian: C.Oravecz and L.Tihanyi
Romanian: S.Bruda, C.Diaconu, L.Diaconu, and D.Tufis
Slovene: T.Erjavec, P.Holozan and M.Romih



Contents



A.Introduction

This document describes the resources for text segmentation and the lexicons developed in Task 1.2., Language-specific resources. Each partner has developed a set of resource files required by the MULTEXT segmentation tools for their language and a lexicon consisting of the following:


LanguageLemmasWord forms
Bulgarian17567295431
Czech15703155649
Estonian42803118482
Hungarian764425081
Romanian388141293226
Slovene1073026197


All lexicons were developed according to the specifications provided by Task 1.1 and are in the MULTEXT format.



B. Segmenter resources

Some of the subtools in the segmenter require certain language-specific information in order to accomplish their tasks. For maximum flexibility and to retain language-independence, all such information is provided directly to the subtools via external resource files. Each partner has developed a set of resource files required by the segmentation tools for their language.

All partners refined the contents of the segmenter resource files for their languages via a series of steps of step-wise refinement, in which the segmenter was tested using Orwell's "1984" as input, results sent to the partners and checked, and revisions provided to the resource files on the basis of the output.

All resource files have names in the following form:

tbl.nnnnn.xx

where 'nnnnn' is replaced by a specific file name identifying the contents, and 'xx' is the two-letter ISO standard code [ISO 639:1988] for the language:
bgBulgarian
csCzech
etEstonian
huHungarian
roRomanian
slSlovenian

The following specific resource files were provided by each partner:

tbl.punct.xx
The file tbl.punct.xx contains the definition of those characters and character configurations that are to be considered as punctuation. They are defined in a regular expression format and each one is assigned an appropriate class, such as "internal punctuation", "non-breaking punctuation", etc.
tbl.comppunct.xx
The file tbl.comppunct.xx contains complex punctuation which includes space characters (such as ". . .").
tbl.abbrev.xx
The file tbl.abbrev.xx contains abbreviations ending with a period for the language in question. It is used by the module of the segmenter called mtsegnabbrev. Some abbreviations are identified as belonging to special classes, such as "title", "initial", etc.
tbl.compabbrev.xx
The file tbl.compabbrev.xx contains space-composed abbreviations (space replaced by underscore), if they exist.
tbl.clitics.xx
The file tbl.clitics.xx contains "clitics" for the language. These are not necessarily true clitics; for our purposes, clitics are regarded as anything separated by a hyphen or an apostrophe. Each is identified as belonging to a class called either PROCLITIC (i.e., proclitic, for items appearing at the beginning of the string) or ENCLITIC (i.e., enclitic, for items appearing at the end of the string). These class names are associated with the identified tokens in the output of the mtsegclitics module of the segmenter.
tbl.compound.xx
The file tbl.compound.xx contains multi-word units which need to be re-combined. Orthographic words which are separated by blanks are split into separate tokens by one of the segmenter's subtools which is invoked early in the chain; this file indicates when such words should be regarded as a single token comprising a compound word.



C. Lexicons


C.1. Format and notation

The MULTEXT-EAST lexicons use the MULTEXT lexical description format, which consists of linear strings of characters representing the morphosyntactic information to be associated with word-forms. The string is constructed following the philosophy of the Intermediate Format proposed in the EAGLES Corpus proposal (Leech and Wilson, 1994), i.e. of having agreed symbols in predefined and fixed positions: the positions of a string of characters are numbered 0, 1,2, etc. in the following way:

Example: Ncms- means (noun,common,masculine,singular,nocase)

This notation adopts the EAGLES Intermediate Format with a small revision: the Intermediate Format encodes information by means of digits, while in MULTEXT characters of a mnemonic nature are preferred.

It is worth noting here that this representation is proposed for word-form lists which will be used for a specific application, i.e. corpus annotation. We have foreseen these lexical descriptions as containing a full description of lexical items. As noted above, the sets of tags, to be used properly for automatic corpus annotation tools, are expected to contain less information.

These lexical descriptions can be seen as notational variants of the feature-based notation in the form of attribute-value pairs. In fact, the string notation proposed, e.g. Ncms- is completely synonymous to a feature-structure representation:

+-                   -+
| Cat:    Noun        |
| Type:   common      |
| Gender: masculine   |
| Number: singular    |
+-                   -+
Formal characteristics relevant for our applications have been kept. Use of position in the string to encode attributes makes no restrictions on the set of characters to be used as values. It could then be inferred that, if we wanted to keep the formal characteristic of order independent notation, we would have to make sure that the characters meant to represent attribute-values are not ambiguous. As attributes and values are linked by positional criteria, the need of a special marker for void attribute-value pairs is evident if we want to keep descriptions coherent. Thus, the "Ncms-" style can be viewed as a short-hand notation convenient for some users and straightforwardly mappable to the information used in unification-based attribute-value pairs formalisms.

C.2. Bulgarian lexicon

1. Number of lemmas and word forms


LemmasEntries
Nouns989147969
(masculine418025100)
(feminine412016493)
(neuter15916376)
Verbs4140226666
Adjectives215519397
Pronouns92110
Adverbs790790
Adpositions9898
Conjunctions7676
Numerals6767
Interjections172172
Particles8686
Total17567295431


2. Means of creation

At the beginning of our work we had two machine readable dictionaries of Bulgarian Language: (1) "Bulgarian explanatory dictionary" fourth edition 1994; (2) "Orthographic (spelling) dictionary of Bulgarian language" first edition 1983. From the first dictionary we extract the main words with an appropriate grammatical information: noun, masculine; verb, transitive, perfect ... . The dictionary entries in the second dictionary contain only the base form of each lemma with attached set of word forms that characterise the paradigm of a given lemma. This information is enough for human users to generate the whole paradigm of each lemma.

We have constructed a set of rules for explication of this implicit information. The rules were encoded in a program and run on the mixed vocabulary of the two dictionaries. The result is a list of lemmas with enough information for automatic generation of all word forms. From this list the lexicon of Bulgarian part of the MULTEXT-EAST project was extracted.

The Bulgarian MTE lexicon mostly covers the available texts (Orwell and so on). We constructed a program that attaches the MTE lexical descriptions to each word form generated by the list of lemmas.

C.3. Czech lexicon

1. Number of lemmas and word forms


LemmasWordforms
Nouns:688945602
Verbs:360416685
Adjectives:403891378
Pronouns:33541
Determiners:00 (irrelevant for Czech)
Articles:00 (irrelevant for Czech)
Adverbs:748900
Adpositions:4147
Conjunctions:3846
Numerals:19127
Interjections:00
Residual:00
Abbreviations:260290
Particles:3333
Total:15703155649


2. Means of creation

Authors of the word form lexicon:

The Word Form lexicon was elaborated as follows. All the word forms from all three MTE corpora have been submitted to the Czech morphological analyzer. This analyzer assigns each word form its morphosyntactic information in the form of a string of characters which is similar to that used in MULTEXT-EAST. Thus, conversion tables between the Czech analyzer format and that of MULTEXT-EAST had to be written in which correspondences of the formats were reflected. Then the conversion program was written. However, in addition to this there are important differences and problems which had to be solved. The Czech analyzer did not properly account for such parts of speech as pronouns, prepositions, conjunctions, and in some cases the MULTEXT-EAST morphosyntactic information was more specific than that accounted for in the Czech morphological analyzer. Special table of these parts of speech and irregulars was written. Moreover, it was not always easy to mark the lemma by '=' because eg. for nouns the lemma in Czech need not be necessarily masculine singular (normally a default). Moreover, some words encountered in the three corpora were not yet inclued in our electronic internal lexicon for Czech, so the lexicon had to be extended by these words. This was due to the fact that the corpora chosen for the Czech MTE monolingual corpus were lexically very rich not only in, say, common nouns but also in proper nouns (opera characters in the fiction corpus, for instance).

After the word form lexicon was eventually developed and processed by a sequence of various conversion programs it was validated by Tomaz' Erjavec's validating tools (mtems-expand, lexmsd) which are available for the MTE partners at the Slovene www site (Tomaz Erjavec' support was much appreciated by the Czech partner). Some errors and bugs have been detected by T. Erjavec's validator but after approx. 3 tests the lexicon was fully conformant with the latest version of the lexical specifications.

The lexicon contains more than required 15000 lemmas (see above). However, the three MTE Czech corpora processed contain more lemmas and word forms than the lexicon which was delivered. So a choice of word forms and lemmas to be included in the WF lexicon has been performed. The lexicon can be extended by automatic and semiautomatic means (new words had and have to be included in the internal lexicon for Czech).

Having in mind extreme richness of Czech morphology which is often very intricate and irregular (most complicated in the family of Slavic languages) the preparation of the word form lexicon is considered by the Czech partner as the most difficult task within the MTE project so far. Therefore the task could be accomplished only hours before the prescribed deadline.

Notes:

(a) No strings belonging to the class Residual have been included into the WFLexicon but they can be added almost without problems.

(b) The lexicon is in the process of constant extending.


C.4. Estonian lexicon

1. Number of lemmas and word forms


LemmasWordforms
Nouns N2981673228
Verbs V313318253
Adjectives A1094922542
Numerals M137561
Pronouns P60926
Adpositions S177177
Conjunctions C2323
Interjections I9494
Adverbs R26782678
Together42803118482

The lemmas were counted using a series of scripts like one in Appendix 1.


2. Means of creation

The lexicon was created semi-automatically. The basis for creating it was a corpus of 450,000 words from Estonian written texts from 1985 that are included in the Base Corpus of Estonian Literary Language, created at the University of Tartu. The corpus for creating the lexicon contains 150 kW of newspapers, 150 kW of fiction (both of which contain the 100 kW MTE corpora) and 150 kW of science.

First a frequency list of the wordforms was made. Then non-words (numbers, 1-letter words, acronyms and abbreviations) were deleted from the list and it was run through a morphology analyser from a company called "Filosoft". Then the output was transformed to conform to the MTE specifications. The transformation was done semi-automatically, using some UNIX scripts specifically written for this task.

It was thought this would be better than to generate all the possible wordforms and omit the compounds and derivations. The way the lexicon was created in a way imitates a situation when you have no morphology analyser but have some people able to link wordforms with lemmas.

As a result, the lexicon:

Bugs and delicate matters:

  1. For some reason, there are about 30 doubled entries in the lexicon. These should be deleted.
  2. The lexicon does not contain compound words with hyphens. Some of these words can be analyzed as separate simplex words, but some cannot.
  3. The lexicon does not contain adjectives which are homonymous with geographical names, differing from the latter only in that they begin with a small letter, eg. Aasia (Engl. Asia) -- aasia (Engl. asian).


C.5. Hungarian lexicon

1. Number of lemmas and word forms

The text contained 81.167 words. There were 25.081 different forms.


LemmasWordforms
Nouns N252610591
Verbs V6647069
Adjectives A32285508
Numerals M117294
Pronouns P78483
Adpositions S55106
Conjunctions C5050
Interjections I2323
Adverbs R863880
Article T14
Abbreviations Y15
Residuals X3868
Together764425081


2. Means of creation

Since sufficient morphological analysis for the Hungarian language with list-based dictionaries cannot be made, we had to follow a different way while we tried to keep the maximal compatibility with Multext policy.

For this we made two important steps.

In the first we eliminated the derivative and compound forms. This process is very straightforward in Hungarian. We took the constituent morphemes of the derivative or compound form, and created a new word by merging these constituents into one new word and giving it the word class of the rightmost constituent. This way we reduced the number of segmentation sequences to an acceptable limit which could be described in the framework of MULTEXT -EAST lexical specification. So we reached the compatibility with the MULTEXT lexical specificationss by reducing segmented constituents to only Stem+Suffix combinations.

But the number of the different Stem+suffix combiniations is still too high to list them in a dictionary. Important: with the reduction of the prefix+stem+derivation to new stems we created enourmous number of new elements which now should be listed in the dictionary. For the exact number see Appendix 2. The Number of Hungarian Word Forms:

... from an average verb we can create 540 other verbs and 2160 nominal forms. With all of its suffixes a verb can have more than 2 million forms...

To reach as much compatibility as possible with the morphological process we decided on the followings: We extended the analysis of Hungarian with another step.

In this step a Hungarian morphological analyzer (HUMOR, delivered by the subcontractor) processes the text and then a tool converts the results to a MULTEXT dictionary. After it may come the Multext morphological analyzer which runs using the dictionary created in the previous step and dedicated specially to the current text. If we create the appropriate dictionary and supply it with the text (e.g. for 1984) than the analysis can be carried out at places where the additional tool is not available or where the extension of the process is unwanted.


C.6. Romanian lexicon


There are two dictionaries for Romanian. The first one is the proper dictionary,containing words likely to be found in usual written texts. The second one, is the dictionary of invented words (newspeek) appearing in the Orwell's book "1984", which is the text used for building the multilingual corpus. In the statistics provided below, the newspeek "words" are shown as additional (+ k). For instance the lines:

Adjectives (A)LemmasWordforms
11078+24347128+128

are to be interpreted as follows: there are 11078 lemmas for normal Romanian adjectives which produced 347128 inflected forms plus 24 newspeek Romanian adjectives which produced other 128 additional wordforms.

1. Number of lemmas and word forms

The statistical data on the Romanian lexicon (including the newspeak words) is given below:


Adjectives (A)11078+24347128+128
Conjunctions (C)59121
Determiners (D)60890
Interjections (I)182+3187+3
Numerals (M)981622
Nouns (N)17398+36322284+142
Pronouns (P)80+11630+2
Particles (Q)614
Adverbs (R)688+21247+6
Adpositions (S)57+289+4
Articles (T)1279
Verbs (V)9084+3617921+12
Residuals (X)1214
Total38814+711293226+297

2. Means of creation

The Multext-East lexicon has been created by means of an unification-based linguistic processing environment (mac-ELU) which was developed by us in cooperation with ISSCO-Geneva. We developed a large unification-based lexicon (about 20.000 entries).

In the Appendix 4, we provide a mac-ELU (partial) description of the Romanian dictionary encoding.


C.7. Slovenian

1. Number of lemmas and word forms


CategoryEntrsWFSsLms=MSDs
Nouns (N):24205103844407274788
Verbs (V):1533792853053832114
Adjectives (A):40682793031761039245
Pronouns (P):19914387859976
Adverbs (R):1818180713423603
Adpositions (S):12010644776
Conjunctions (C):37371372
Numerals (M):135919561139
Interjections (I):66161
Residuals (X):00000
Abbreviations (Y):17171171
Particles (Q):61611611
TOTAL (*):85633261971073052361576

Entrs: Number of entries
WFSs: Number of distinct word-forms
Lms: Number of distinct lemmas
= : Number of '=' lemmas
MSDs: Number of distinct morphosyntactic descriptions

2. Means of creation

The lexicon is produced for the Ljubljana partner by the subcontractor Amebis d.o.o. from their proprietary lexical database.

The lexicon was created automatically with expanding the Amebis database entries and mapping their descriptions into the MULTEXT-EAST V2 morphosyntactic descriptions.

The Slovene word-form lexicon currently covers the non-idiosyncratic word-forms appearing the in Slovene 1984 and Fiction corpus. Thus Newspeak and uncommon proper names are not included in the lexicon; furthermore, only word-forms actually appearing in the corpus are included, and not the complete paradigms of the lexicon lemmas.




Copyright © Centre National de la Recherche Scientifique, 1996.