Multext-East
Copernicus project COP 106 |
This document describes the resources for text segmentation and the lexicons developed in Task 1.2., Language-specific resources. Each partner has developed a set of resource files required by the MULTEXT segmentation tools for their language and a lexicon consisting of the following:
Language Lemmas Word forms Bulgarian 17567 295431 Czech 15703 155649 Estonian 42803 118482 Hungarian 7644 25081 Romanian 38814 1293226 Slovene 10730 26197
All lexicons were developed according to the specifications provided by Task 1.1 and are in the MULTEXT format.
Some of the subtools in the segmenter require certain language-specific information in order to accomplish their tasks. For maximum flexibility and to retain language-independence, all such information is provided directly to the subtools via external resource files. Each partner has developed a set of resource files required by the segmentation tools for their language.
All partners refined the contents of the segmenter resource files for their languages via a series of steps of step-wise refinement, in which the segmenter was tested using Orwell's "1984" as input, results sent to the partners and checked, and revisions provided to the resource files on the basis of the output.
All resource files have names in the following form:
tbl.nnnnn.xx
where 'nnnnn' is replaced by a specific file name identifying the contents, and 'xx' is the two-letter ISO standard code [ISO 639:1988] for the language:
bg Bulgarian
cs Czech
et Estonian
hu Hungarian
ro Romanian
sl Slovenian
The following specific resource files were provided by each partner:
- tbl.punct.xx
- The file tbl.punct.xx contains the definition of those characters and character configurations that are to be considered as punctuation. They are defined in a regular expression format and each one is assigned an appropriate class, such as "internal punctuation", "non-breaking punctuation", etc.
- tbl.comppunct.xx
- The file tbl.comppunct.xx contains complex punctuation which includes space characters (such as ". . .").
- tbl.abbrev.xx
- The file tbl.abbrev.xx contains abbreviations ending with a period for the language in question. It is used by the module of the segmenter called mtsegnabbrev. Some abbreviations are identified as belonging to special classes, such as "title", "initial", etc.
- tbl.compabbrev.xx
- The file tbl.compabbrev.xx contains space-composed abbreviations (space replaced by underscore), if they exist.
- tbl.clitics.xx
- The file tbl.clitics.xx contains "clitics" for the language. These are not necessarily true clitics; for our purposes, clitics are regarded as anything separated by a hyphen or an apostrophe. Each is identified as belonging to a class called either PROCLITIC (i.e., proclitic, for items appearing at the beginning of the string) or ENCLITIC (i.e., enclitic, for items appearing at the end of the string). These class names are associated with the identified tokens in the output of the mtsegclitics module of the segmenter.
- tbl.compound.xx
- The file tbl.compound.xx contains multi-word units which need to be re-combined. Orthographic words which are separated by blanks are split into separate tokens by one of the segmenter's subtools which is invoked early in the chain; this file indicates when such words should be regarded as a single token comprising a compound word.
The MULTEXT-EAST lexicons use the MULTEXT lexical description format, which consists of linear strings of characters representing the morphosyntactic information to be associated with word-forms. The string is constructed following the philosophy of the Intermediate Format proposed in the EAGLES Corpus proposal (Leech and Wilson, 1994), i.e. of having agreed symbols in predefined and fixed positions: the positions of a string of characters are numbered 0, 1,2, etc. in the following way:
- the agreed character at position 0 encodes part-of-speech;
- each character at position 1, 2, n, encodes the value of one attribute (person, gender, number, etc.);
- if an attribute does not apply, the corresponding position in the string contains a special marker, in our case `-' (hyphen).
Example: Ncms- means (noun,common,masculine,singular,nocase)
This notation adopts the EAGLES Intermediate Format with a small revision: the Intermediate Format encodes information by means of digits, while in MULTEXT characters of a mnemonic nature are preferred.
It is worth noting here that this representation is proposed for word-form lists which will be used for a specific application, i.e. corpus annotation. We have foreseen these lexical descriptions as containing a full description of lexical items. As noted above, the sets of tags, to be used properly for automatic corpus annotation tools, are expected to contain less information.
These lexical descriptions can be seen as notational variants of the feature-based notation in the form of attribute-value pairs. In fact, the string notation proposed, e.g. Ncms- is completely synonymous to a feature-structure representation:
Formal characteristics relevant for our applications have been kept. Use of position in the string to encode attributes makes no restrictions on the set of characters to be used as values. It could then be inferred that, if we wanted to keep the formal characteristic of order independent notation, we would have to make sure that the characters meant to represent attribute-values are not ambiguous. As attributes and values are linked by positional criteria, the need of a special marker for void attribute-value pairs is evident if we want to keep descriptions coherent. Thus, the "Ncms-" style can be viewed as a short-hand notation convenient for some users and straightforwardly mappable to the information used in unification-based attribute-value pairs formalisms. +- -+ | Cat: Noun | | Type: common | | Gender: masculine | | Number: singular | +- -+
1. Number of lemmas and word forms
Lemmas Entries Nouns 9891 47969 (masculine 4180 25100) (feminine 4120 16493) (neuter 1591 6376) Verbs 4140 226666 Adjectives 2155 19397 Pronouns 92 110 Adverbs 790 790 Adpositions 98 98 Conjunctions 76 76 Numerals 67 67 Interjections 172 172 Particles 86 86 Total 17567 295431
2. Means of creationAt the beginning of our work we had two machine readable dictionaries of Bulgarian Language: (1) "Bulgarian explanatory dictionary" fourth edition 1994; (2) "Orthographic (spelling) dictionary of Bulgarian language" first edition 1983. From the first dictionary we extract the main words with an appropriate grammatical information: noun, masculine; verb, transitive, perfect ... . The dictionary entries in the second dictionary contain only the base form of each lemma with attached set of word forms that characterise the paradigm of a given lemma. This information is enough for human users to generate the whole paradigm of each lemma.
We have constructed a set of rules for explication of this implicit information. The rules were encoded in a program and run on the mixed vocabulary of the two dictionaries. The result is a list of lemmas with enough information for automatic generation of all word forms. From this list the lexicon of Bulgarian part of the MULTEXT-EAST project was extracted.
The Bulgarian MTE lexicon mostly covers the available texts (Orwell and so on). We constructed a program that attaches the MTE lexical descriptions to each word form generated by the list of lemmas.
1. Number of lemmas and word forms
Lemmas Wordforms Nouns: 6889 45602 Verbs: 3604 16685 Adjectives: 4038 91378 Pronouns: 33 541 Determiners: 0 0 (irrelevant for Czech) Articles: 0 0 (irrelevant for Czech) Adverbs: 748 900 Adpositions: 41 47 Conjunctions: 38 46 Numerals: 19 127 Interjections: 0 0 Residual: 0 0 Abbreviations: 260 290 Particles: 33 33 Total: 15703 155649
2. Means of creationAuthors of the word form lexicon:
- Jan Hajic (Faculty of Mathematics and Physics, Charles University)
- Vladimir Petkevic (Faculty of Philosophy, Charles University)
The Word Form lexicon was elaborated as follows. All the word forms from all three MTE corpora have been submitted to the Czech morphological analyzer. This analyzer assigns each word form its morphosyntactic information in the form of a string of characters which is similar to that used in MULTEXT-EAST. Thus, conversion tables between the Czech analyzer format and that of MULTEXT-EAST had to be written in which correspondences of the formats were reflected. Then the conversion program was written. However, in addition to this there are important differences and problems which had to be solved. The Czech analyzer did not properly account for such parts of speech as pronouns, prepositions, conjunctions, and in some cases the MULTEXT-EAST morphosyntactic information was more specific than that accounted for in the Czech morphological analyzer. Special table of these parts of speech and irregulars was written. Moreover, it was not always easy to mark the lemma by '=' because eg. for nouns the lemma in Czech need not be necessarily masculine singular (normally a default). Moreover, some words encountered in the three corpora were not yet inclued in our electronic internal lexicon for Czech, so the lexicon had to be extended by these words. This was due to the fact that the corpora chosen for the Czech MTE monolingual corpus were lexically very rich not only in, say, common nouns but also in proper nouns (opera characters in the fiction corpus, for instance).
After the word form lexicon was eventually developed and processed by a sequence of various conversion programs it was validated by Tomaz' Erjavec's validating tools (mtems-expand, lexmsd) which are available for the MTE partners at the Slovene www site (Tomaz Erjavec' support was much appreciated by the Czech partner). Some errors and bugs have been detected by T. Erjavec's validator but after approx. 3 tests the lexicon was fully conformant with the latest version of the lexical specifications.
The lexicon contains more than required 15000 lemmas (see above). However, the three MTE Czech corpora processed contain more lemmas and word forms than the lexicon which was delivered. So a choice of word forms and lemmas to be included in the WF lexicon has been performed. The lexicon can be extended by automatic and semiautomatic means (new words had and have to be included in the internal lexicon for Czech).
Having in mind extreme richness of Czech morphology which is often very intricate and irregular (most complicated in the family of Slavic languages) the preparation of the word form lexicon is considered by the Czech partner as the most difficult task within the MTE project so far. Therefore the task could be accomplished only hours before the prescribed deadline.
Notes:
(a) No strings belonging to the class Residual have been included into the WFLexicon but they can be added almost without problems.
(b) The lexicon is in the process of constant extending.
1. Number of lemmas and word forms
Lemmas Wordforms Nouns N 29816 73228 Verbs V 3133 18253 Adjectives A 10949 22542 Numerals M 137 561 Pronouns P 60 926 Adpositions S 177 177 Conjunctions C 23 23 Interjections I 94 94 Adverbs R 2678 2678 Together 42803 118482 The lemmas were counted using a series of scripts like one in Appendix 1.
2. Means of creationThe lexicon was created semi-automatically. The basis for creating it was a corpus of 450,000 words from Estonian written texts from 1985 that are included in the Base Corpus of Estonian Literary Language, created at the University of Tartu. The corpus for creating the lexicon contains 150 kW of newspapers, 150 kW of fiction (both of which contain the 100 kW MTE corpora) and 150 kW of science.
First a frequency list of the wordforms was made. Then non-words (numbers, 1-letter words, acronyms and abbreviations) were deleted from the list and it was run through a morphology analyser from a company called "Filosoft". Then the output was transformed to conform to the MTE specifications. The transformation was done semi-automatically, using some UNIX scripts specifically written for this task.
It was thought this would be better than to generate all the possible wordforms and omit the compounds and derivations. The way the lexicon was created in a way imitates a situation when you have no morphology analyser but have some people able to link wordforms with lemmas.
As a result, the lexicon:
- contains 43kW of lemmas and 118kW of wordforms
- should cover 95% of the wordforms met in '1984', MTE fiction and newspaper corpora
- contains compound words and derivations which normally would not be a part of the lexicon
- the complete paradigms for all the lemmas are not represented
Bugs and delicate matters:
- For some reason, there are about 30 doubled entries in the lexicon. These should be deleted.
- The lexicon does not contain compound words with hyphens. Some of these words can be analyzed as separate simplex words, but some cannot.
- The lexicon does not contain adjectives which are homonymous with geographical names, differing from the latter only in that they begin with a small letter, eg. Aasia (Engl. Asia) -- aasia (Engl. asian).
1. Number of lemmas and word formsThe text contained 81.167 words. There were 25.081 different forms.
Lemmas Wordforms Nouns N 2526 10591 Verbs V 664 7069 Adjectives A 3228 5508 Numerals M 117 294 Pronouns P 78 483 Adpositions S 55 106 Conjunctions C 50 50 Interjections I 23 23 Adverbs R 863 880 Article T 1 4 Abbreviations Y 1 5 Residuals X 38 68 Together 7644 25081
2. Means of creationSince sufficient morphological analysis for the Hungarian language with list-based dictionaries cannot be made, we had to follow a different way while we tried to keep the maximal compatibility with Multext policy.
For this we made two important steps.
In the first we eliminated the derivative and compound forms. This process is very straightforward in Hungarian. We took the constituent morphemes of the derivative or compound form, and created a new word by merging these constituents into one new word and giving it the word class of the rightmost constituent. This way we reduced the number of segmentation sequences to an acceptable limit which could be described in the framework of MULTEXT -EAST lexical specification. So we reached the compatibility with the MULTEXT lexical specificationss by reducing segmented constituents to only Stem+Suffix combinations.
But the number of the different Stem+suffix combiniations is still too high to list them in a dictionary. Important: with the reduction of the prefix+stem+derivation to new stems we created enourmous number of new elements which now should be listed in the dictionary. For the exact number see Appendix 2. The Number of Hungarian Word Forms:
... from an average verb we can create 540 other verbs and 2160 nominal forms. With all of its suffixes a verb can have more than 2 million forms...To reach as much compatibility as possible with the morphological process we decided on the followings: We extended the analysis of Hungarian with another step.
In this step a Hungarian morphological analyzer (HUMOR, delivered by the subcontractor) processes the text and then a tool converts the results to a MULTEXT dictionary. After it may come the Multext morphological analyzer which runs using the dictionary created in the previous step and dedicated specially to the current text. If we create the appropriate dictionary and supply it with the text (e.g. for 1984) than the analysis can be carried out at places where the additional tool is not available or where the extension of the process is unwanted.
- Dan Tufiş
Center for Research in Machine Learning, Natural Language Processing and Conceptual Modelling - RACAI, Bucharest (contractor)- Lidia Diaconu, Călin Diaconu, Ana-Maria Barbu, Camelia Popescu, Ileana Letinu
Research Institute for Informatics - ICI, Bucharest (subcontractor)
There are two dictionaries for Romanian. The first one is the proper dictionary,containing words likely to be found in usual written texts. The second one, is the dictionary of invented words (newspeek) appearing in the Orwell's book "1984", which is the text used for building the multilingual corpus. In the statistics provided below, the newspeek "words" are shown as additional (+ k). For instance the lines:
Adjectives (A) Lemmas Wordforms 11078+24 347128+128 are to be interpreted as follows: there are 11078 lemmas for normal Romanian adjectives which produced 347128 inflected forms plus 24 newspeek Romanian adjectives which produced other 128 additional wordforms.
1. Number of lemmas and word forms
The statistical data on the Romanian lexicon (including the newspeak words) is given below:
Adjectives (A) 11078+24 347128+128 Conjunctions (C) 59 121 Determiners (D) 60 890 Interjections (I) 182+3 187+3 Numerals (M) 98 1622 Nouns (N) 17398+36 322284+142 Pronouns (P) 80+1 1630+2 Particles (Q) 6 14 Adverbs (R) 688+2 1247+6 Adpositions (S) 57+2 89+4 Articles (T) 12 79 Verbs (V) 9084+3 617921+12 Residuals (X) 12 14 Total 38814+71 1293226+297 2. Means of creation
The Multext-East lexicon has been created by means of an unification-based linguistic processing environment (mac-ELU) which was developed by us in cooperation with ISSCO-Geneva. We developed a large unification-based lexicon (about 20.000 entries).
In the Appendix 4, we provide a mac-ELU (partial) description of the Romanian dictionary encoding.
1. Number of lemmas and word forms
Category Entrs WFSs Lms = MSDs Nouns (N): 24205 10384 4407 2747 88 Verbs (V): 15337 9285 3053 832 114 Adjectives (A): 40682 7930 3176 1039 245 Pronouns (P): 1991 438 78 59 976 Adverbs (R): 1818 1807 1342 360 3 Adpositions (S): 120 106 44 77 6 Conjunctions (C): 37 37 1 37 2 Numerals (M): 1359 195 6 1 139 Interjections (I): 6 6 1 6 1 Residuals (X): 0 0 0 0 0 Abbreviations (Y): 17 17 1 17 1 Particles (Q): 61 61 1 61 1 TOTAL (*): 85633 26197 10730 5236 1576 Entrs: Number of entries
WFSs: Number of distinct word-forms
Lms: Number of distinct lemmas
= : Number of '=' lemmas
MSDs: Number of distinct morphosyntactic descriptions
2. Means of creation
The lexicon is produced for the Ljubljana partner by the subcontractor Amebis d.o.o. from their proprietary lexical database.
The lexicon was created automatically with expanding the Amebis database entries and mapping their descriptions into the MULTEXT-EAST V2 morphosyntactic descriptions.
The Slovene word-form lexicon currently covers the non-idiosyncratic word-forms appearing the in Slovene 1984 and Fiction corpus. Thus Newspeak and uncommon proper names are not included in the lexicon; furthermore, only word-forms actually appearing in the corpus are included, and not the complete paradigms of the lexicon lemmas.
Copyright © Centre National de la Recherche Scientifique, 1996.