Representation of Linguistic Corpora

Principal investigator

Nancy Ide: Department of Computer Science
Vassar College
Poughkeepsie, New York 12601 USA
tel : (+1) 914 437 5988
fax : (+1) 914 437 7498
e-mail : ide@cs.vassar.edu; Laboratoire Parole et Langage
CNRS & Université de Provence
29, Avenue Robert Schuman, 13621 Aix-en-Provence Cedex 1, France
tel : (+33) 42 95 36 34
fax : (+33) 42 59 50 96
e-mail: ide@univ-aix.fr

Project summary

This project is intended to provide a theoretical background and develop coherent methodologies for the representation, access, and manipulation of corpora intended for use in corpus-based natural language processing (NLP) research. The project builds on and continues a program of collaborative research, established in 1988, between Vassar College's Department of Computer Science and the Laboratoire Parole et Langage (LP&L) of the The Centre National de la Recherche Scientifique (CNRS) in Aix-en-Provence, France. The work is carried out in the context of the European projects MULTEXT, MULTEXT-EAST, and EAGLES (in particular, the EAGLES Text Representation subgroup), supported under the European Commission LRE program and coordinated by LP&L. The work undertaken at Vassar College is supported by a grant from the National Science Foundation (NSF RUI grant IRI-9413451).

The increasing interest in the use of large-scale textual resources for NLP research has led to the rapid proliferation of both massive amounts of textual data and text-handling tools. Much of the currently available data is marked and annotated using ad hoc formats, most of which are entirely inconsistent with one another, and almost none of which has been developed on the basis of a sound model of text and text categories or in view of any serious consideration of the needs of corpus-based NLP research. Similarly, and for related reasons, there is an enormous redundancy in the functionality of much existing corpus-handling software (part-of-speech taggers, statistics-gathering programs, etc.), due to the fact that the same systems need to be re-invented over and over again to accomodate specific inputand output formats and platforms. Because such software is typically instantiated in large, unbreakable systems, the ability to modify it and re-use relevant pieces in other applications is severely limited. Again,the lack of a principled basis for text software design is the cause ofthis redundancy and limited reusability.

Our goal is to develop a sound basis and methodology for corpus representation as well as for the design of corpus-handling tools. There is an obvious dependency between the two, which demands that they are developed hand-in-hand. The task involves: (1) analysis of the needs of corpus-based NLP research, both in terms of the kinds and degree of annotation required and the requirements for efficient processing, accessibility, etc.; (2) analysis of general properties and configuration of corpora, analysis of relevant structural and logical features of component text types, and the design of encoding mechanisms that can represent all required elements and features while accomodating the requirements determined in (1); and (3) specifications for text software design, coordinated with (2), with the aim of avoiding redundancy and maximizing modifiability, extendability, and reusability.

Current activity

Currently, the forcus of work within the project is the development of a Corpus Encoding Standard (CES) optimally suited for use in language engineering, intended to serve as a widely accepted set of encoding standards for corpus-based work in natural language processing applications. The CES is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language) compliant with the specifications of the TEI Guidelines for Electronic Text Encoding and Interchange of the Text Encoding Initiative. The CES specifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora.

The CES is being developed in a bottom up fashion, starting with minimal specifications and expanding based upon feedback resulting from its use, and the input of the research community in general. Comments and discussion on any aspect of the CES are invited and encouraged. The most recent draft of the CES is available at <URL:http://www.cs.vassar.edu/CES/> and <URL:http://www.lpl.univ-aix.fr/projects/eagles/TR/>. The document is also available by ftp as a tar file.

References

Bourbeau, L., Pinard, F. (1995). Normalisation et internationalistion: Inventaire et prospective des normes clefs pour le traitement informatique du français. Progiciels BPI. Montréal. <URL:ftp://fcar.qc.ca/pub/ceveil/norme_t1.zip>

Bryan, M. (1988)SGML: An Author's Guide, Addison-Wesley Publishing Company, New York.

Burnard, L. (1995). Text Encoding for Information Interchange-- An Introduction to the Text Encoding Initiative, TEI Document no TEI J31, Oxford University Computing Services. <URL:http://info.ox.ac.uk/~archive/teij31>

Coombs, J.H., Renear, A.H., and DeRose, S.J. (1987).Markup systems and the future of scholarly text processing. Communications of the ACM, 30, 11, 933- 947.

Cover, R. (1994). SGML Web Page. <URL:http://www.sil.org/sgml/sgml.html>

DeRose, S.J., Durand, D.G. (1994). Making HyperMedia Work: A Users's Guide to HyTime. Kluwer Academic Publishers, Boston.

Goldfarb, C.F. (1990). The SGML Handbook, Clarendon Press, Oxford.

Ide, N. et al. Corpus Encoding Standard. Version of December 1995. <URL:http://www.cs.vassar.edu/CES/>

Ide, N. Encoding standards for large text resources. Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan (1994), 574-78.

Ide, N., Véronis, J. MULTEXT: Multilingual Text Tools and Corpora. Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan, (1994) 588-92.

Ide, N., Véronis, J. What next after the Text Encoding Initiative? The need for text software. ACH Newsletter, Winter (1993), 1-3.

Ide, N., Véronis, J. (Eds.) (1995a). The Text Encoding Initiative: Background and Context. Kluwer Academic Publishers, Dordrecht, 342p. [reprinted from triple special issue of Computers and the Humanities, 29, no 1/2/3, with an original bibliography]

ISO 8879 (1986). Information Processing--Text and Office Systems--Standard Generalized Markup Language (SGML), ISO, Geneva.

ISO/IEC DIS 10744 (1992). Hypermedia/Time-based Document Structuring Language (Hytime), ISO, Geneva.

Kimber, W. Eliot (1995). Practical Hypermedia: An Introduction to HyTime. Charles F. Goldfarb Series On Open Information Management. Prentice-Hall Professional Technical Reference, New York; approximately 250 pages. ISBN: 0-13-309899-0.

Newcomb, Steven R., Kipp, Neill A.; Newcomb, Victoria T. (1991). The 'HyTime' Hypermedia/Time-based Document Structuring Language. Communications of the Association for Computing Machinery 34/11 (November 1991) 67-83.

Sperberg-McQueen, C.M., Burnard, L. (Eds.) (1994). Guidelines for Electronic Text Encoding and Interchange, Text Encoding Initiative, Chicago and Oxford. <URL:http://etext.virginia.edu/TEI.html>

van Herwijnen, E. (1991). Practical SGML, Kluwer Academic Publishers, Boston [2nd edition, 1994].

NAVIGATOR