The XCES consists of eight schemas. The xcesGlobal and xcesLink schemas do not declare any elements and are imported/included by the other schemas. The eight schemas are:
xcesDoc.xsd | : Encoding conventions for level 1 XCES documents. |
xcesAna.xsd | : Encoding conventions for annotated data. |
xcesAlign.xsd | : Encoding conventions for aligned data. |
xcesWord.xsd | : Extends xcesDoc to provide word level tags for stand-off annotation. |
xcesSpoken.xsd | : Extends xcesDoc to provide tags for encoding spoken data. |
xcesHeader.xsd | : The XCES header used by all XCES documents. |
xcesGlobal.xsd | : Global group and type definitions. |
xcesLink.xsd | : XLink attribute definitions used in xcesAna.xsd and xcesAlign.xsd. : Used to import the xlink namespace. |
Download all files: xces-schema-0_2.zip
The XCES schemas were created automatically from the XCES DTDs using XML Spy and then extensively hand modified.
xcesDoc resources :
<?xml version="1.0"?> <cesCorpus xmlns="http://www.xml-ces.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.xml-ces.org/schema http://www.cs.vassar.edu/XCES/schema/xcesDoc.xsd" version="1.0"> ... </cesCorpus>
xcesAna resources :
<cesAna xmlns="http://www.xml-ces.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.xml-ces.org/schema http://www.cs.vassar.edu/XCES/schema/xcesAna.xsd" version="1.0"> ... </cesAna>
xcesAlign resources :
<?xml version="1.0"?> <cesAlign xmlns="http://www.xml-ces.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.xml-ces.org/schema http://www.cs.vassar.edu/XCES/schema/xcesAlign.xsd" version="1.0"> ... </cesAlign>
xcesWord resources :
<?xml version="1.0"?> <cesWord xmlns="http://www.xml-ces.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.xml-ces.org/schema http://www.cs.vassar.edu/XCES/schema/xcesWord.xsd" version="1.0"> ... </cesWord>
xcesSpoken resources :
<?xml version="1.0"?> <cesSpoken xmlns="http://www.xml-ces.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.xml-ces.org/schema http://www.cs.vassar.edu/XCES/schema/xcesSpoken.xsd" version="1.0"> ... </cesSpoken>
cesHeader resources :
<?xml version="1.0"?> <cesHeader xmlns="http://www.xml-ces.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.xml-ces.org/schema http://www.cs.vassar.edu/XCES/schema/xcesHeader.xsd" version="1.0"> ... </cesHeader>
The XCES header is described in its own schema making it possible to create standalone header files. The header can be stored in the document in a <cesHeader> element as in the CES, or the header may be stored externally with a <xcesHeader> element used to link to the header file. Although it is not defined here, it is possible to define a new document type that consists of a sequence of headers in one file (a headerbase) and use XPointer expressions to locate the fragment containing the desired header.
Headers.XML : <?xml version="1.0"?>
<cesHeaders>
<cesHeader id="h1"/> ... </cesHeader>
<cesHeader id="h2"/> ... </cesHeader>
<cesHeader id="h3"/> ... </cesHeader>
<cesHeader id="h4"/> ... </cesHeader>
</cesHeaders>
Corpus.XML : <?xml version="1.0"?> <cesCorpus> <cesHeader> ... </cesHeader> <cesDoc> <xcesHeader xlink:href="Headers.XML#h1"/> <text> ... </text> </cesDoc> <cesDoc> <xcesHeader xlink:href="Headers.XML#h2"/> <text> ... </text> </cesDoc> ... </cesCorpus>
The following elements have been added to the <profileDesc> element in the <cesHeader>.
<particDesc> (i.e. participation description)
Description | :Describes the identifiable speakers, voices or other participants in a linguistic entertain. | |||||||||
XPath | cesHeader/profileDesc/particDesc | |||||||||
Attributes |
|
|||||||||
Content Model | (person | personGrp)+ particLinks? | |||||||||
Example |
<particDesc> <person id="p1" sex="f" age="42">Female informant, well educated, born in Boston US, 12 Jan 1950, of unknown occupation. Speaks English fluently.</person> <person id="p2" sex="m" age="43"/> <particLinks> <relation active="p1 p2" desc="spouse"/> <particLinks> </particDesc> |
<person>
Description | :Describes an individual participant in a linguistic interation. | |||||||||||||||
XPath | cesHeader/profileDesc/particDesc/person | |||||||||||||||
Attributes |
|
|||||||||||||||
Content Model | Character data | |||||||||||||||
Example |
<person id="p1" sex="f" age="42">Female informant, well educated, born in Boston US, 12 Jan 1950, of unknown occupation. Speaks English fluently. </Person> |
<personGrp>
Description | :Describes a groups of individuals treated as a single entity for analytical reasons. | ||||||||||||||||||
XPath | cesHeader/profileDesc/particDesc/personGrp | ||||||||||||||||||
Attributes |
|
||||||||||||||||||
Content Model | Character data | ||||||||||||||||||
Example |
<person id="p1" sex="f" age="42">Female informant, well educated, born in Boston US, 12 Jan 1950, of unknown occupation. Speaks English fluently.</Person> <person id="p2" sem="m" age="43"/> <particLinks> <relation active="p1 p2" desc="spouse"/> <particLinks> </particDesc> |
<particLinks> (i.e. participation relationships)
Description | :Describes the relationships or social links existing amongst participants in an interaction. | ||||||
XPath | cesHeader/profileDesc/particDesc/particLinks | ||||||
Attributes |
|
||||||
Content Model | relation+ | ||||||
Example |
<particLinks> <relation desc="parent" active="p1 p2" passive="p3 p4" mutual="n"/> <relation desc="spouse" active="p1 p2"/> <relation type="social" desc="employer" active="p1" passive="p3 p5 p6 p7" mutual="n"/> <particLinks> |
<relation> (i.e. relationship)
Description | :Describes any kind of relationship between a specified group of participants. | |||||||||||||||||||||
XPath | cesHeader/profileDesc/particDesc/particLinks/relation | |||||||||||||||||||||
Attributes |
|
|||||||||||||||||||||
Content Model | Empty | |||||||||||||||||||||
Example |
<relation type="social" desc="supervisor" active="p1" passive="p2 p3 p4" mutual="n"/> <relation type="personal" desc="friends" active="p2 p3 p4" mutual="y"/> |
<settingDesc> (i.e. setting description)
Description | :Describes the setting or settings within which a language interaction takes place. | |||||||||
XPath | cesHeader/profileDesc/settingDesc | |||||||||
Attributes |
|
|||||||||
Content Model | setting+ | |||||||||
Example |
<settingDesc> <setting>Texts Recorded in the Canadian Parliment building in Ottawa, between April and November 1988.</setting> </settingDesc> |
<setting>
Description | :Describes one particular setting in which a language interaction takes place. | |||||||||
XPath | cesHeader/profileDesc/settingDesc/setting | |||||||||
Attributes |
|
|||||||||
Content Model | (name | time | locale | activity)* | |||||||||
Mixed | true | |||||||||
Example |
<setting who="p1 p2 p3"> <name>New York City</name> <time>1989</time> <locale>on a park bench</local> <activity>feeding birds</activity> </setting> |
<name> (i.e. name or proper noun)
Description | Contains a proper noun or noun phrase | ||||||
XPath | cesHeader/profileDesc/settingDesc/setting/name | ||||||
Attributes |
|
||||||
Content Model | Character data. | ||||||
Mixed | true | ||||||
Example |
<name>New York City</name> |
<time>
Description | A phrase containing the time of day in any form. | |||||||||||||||
XPath | cesHeader/profileDesc/settingDesc/setting/time | |||||||||||||||
Attributes |
|
|||||||||||||||
Content Model | Character data. | |||||||||||||||
Example |
<setting> |
<locale>
Description | A brief informal description of the nature of a place. | ||||||
XPath | cesHeader/profileDesc/settingDesc/setting/locale | ||||||
Attributes |
|
||||||
Content Model | Character data. | ||||||
Example |
<setting> |
Almost all elements in the XCES have an associated simple or complex type. The only exceptions are the root elements in each schema document. All elements, attributes, and types have been placed in the namespace http://www.xml-ces.org/schema. It is recommended, but not required, that http://www.xml-ces.org/schema be made the default namespace for XCES documents. For the remainder of this document it will be assumed that the prefix xces: refers to the namespace http://www.xml-ces.org/schema.
The CES DTDs use an ENTITY definition to represent the set of attributes that belong to the class a.global.
<!ENTITY % a.global ' id ID #IMPLIED n CDATA #IMPLIED lang IDREF #IMPLIED xml:Lang CDATA #IMPLIED'>
In the XCES the a.global entity has been replaced with the attribute group xces:a.global defined in xcesGlobal.xsd.
<Xs:attributeGroup name="a.global"> <Xs:attribute name="id" type="Xs:ID"/> <Xs:attribute name="n" type="Xs:string"/> <Xs:attribute name="Lang" type="Xs:IDREF"/> <Xs:attribute ref="xml:Lang"/> </Xs:attributeGroup>
Each of the top level schemas (xcesAlign, xcesAna, and xcesDoc) extend the set of global attributes by adding attributes specific to that type of document. The attributes added are:
Schema
|
Attribute Group Name
|
Attributes Added
|
Attribute Meaning
|
xcesAlign
|
xces:a.align
|
wsd
|
Character encoding used.
|
xcesAna | xces:a.ana | type | Provides more precise information about the element's function or role. |
wsd | Character encoding used | ||
xcesDoc | xces:a.text | rend | Rendering information about the original version. |
wsd | Character encoding used |
The cesDoc DTD makes use of entities to represent element classes similar to the TEI element classes. In the xcesDoc.xsd schema these are represented by element groups. For example, the entities:
<!ENTITY % m.token 'abbr | date | num |measure | name | term | time |'> <!ENTITY % m.phrase '%m.token; foreign | mentioned | distinct | title | hi | list | corr | gap | reg | ptr | ref'> <!ENTITY % phrase.seq '#PCDATA | %m.phrase;'>
are replaced by the element groups xces:m.token and xces:phrase.seq.
<xs:group name="m.token"> <xs:choice> <xs:element name="abbr" type="xces:abbrType"/> <xs:element name="date" type="xces:dateType"/> <xs:element name="num" type="xces:numType"/> <xs:element name="measure" type="xces:measureType"/> <xs:element name="name" type="xces:nameType"/> <xs:element name="term" type="xces:termType"/> <xs:element name="time" type="xces:timeType"/> </xs:choice> </xs:group> <xs:group name="m.common"> <xs:choice> <xs:element name="list" type="xces:listType"/> <xs:element name="corr" type="xces:corrType"/> <xs:element name="gap" type="xces:gapType"/> <xs:element name="reg" type="xces:regType"/> <xs:element name="ptr" type="xces:ptrTyp"/> <xs:element name="ref" type="xces:refType"/> </xs:choice> </xs:group> <xs:group name="phrase.seq"> <xs:choice> <xs:group ref="xces:m.token"/> <xs:group ref="xces:m.common"/> <xs:element name="foreign" type="xces:foreignType"/> <xs:element name="mentioned" type="xces:mentionedType"/> <xs:element name="distinct" type="xces:distinctType"/> <xs:element name="title" type="xces:titleType"/> <xs:element name="hi" type="xces:hiType"/> </xs:choice> </xs:group>
In addition to the above groups, element groups have also been defined that correspond to the following CES entities:
The xcesGlobal.xsd schema defines the string type xces:class.string that extends xs:string by adding the global attributes xces:a.global. xces:class.string is then used as the base type when defining other string types. All types that extend xces:class.string have a String suffix. i.e. xces:annotationString, xces:creationString, etc.
XLink attributes and XPointer expressions are used in the XCES to represent links between documents. XPointers can be used to express points and ranges in an XML document whether or not elements in the document contain IDs. However, not all CES linking elements have been converted to XLink links. For example, the <cesAlign> element contains fromDoc, toDoc, fromLocation, and toLocation attributes that can be used to specify the target documents that are being aligned. To model these attributes with XLink would require four new elements to be added. Therefore these attributes remain in the XCES as links, however XPointers should be used to specify the targets. I.E.:
<cesAlign fromDoc="corpus/english/text1.xml" toDoc="corpus/spanish/text1.xml" fromLocation="#xpointer(id('p1')/range-to(id('p5')))" toLocation="#xpointer(id('p1')/range-to(id('p6')))" ...>
Or:
<cesAlign fromDoc="corpus/english/text1.xml#xpointer(id('p1')/range-to(id('p5')))" toDoc="corpus/spanish/text1.xml#xpointer(id('p1')/range-to(id('p6')))" ...>
At this time XPointer has not been made a final recomendation by the W3C and as a result the XCES may need to be changed in the future.
There are four linking elements in an XCES annotation document used to indicate a range being annotated: <chunk>, <tok>, <s>, and <par>. In the CES these elements contain the attributes from and to that are used as links. The <chunk> element also contains a doc attribute used as a link. In the XCES these elements are now simple links (xlink:type="simple") and use the xlink:href attribute with an XPointer expression to express the range. For example:
TEXT.CES : <chunk doc="/corpus/en/text1.ces" from="2.1.1.1.2.1\1" to="2.1.1.1.2.1\25"> <tok from="2.1.1.1.2.1\1" to="2.1.1.1.2.5"/> ... </chunk>
Becomes:
TEXT.XCES: <chunk xml:base="/corpus/en/text1.ces" xlink:href="#xpointer(string-range(/2/1/1/1/2/1, '', 1, 25))"> <tok xlink:href="#xpointer(string-range(/2/1/1/1/2/1, '', 1, 5))"/> ... </chunk>
The <cesAna> and <chunkList> elements also contain the xml:base attribute so the common portions of the file's location can be specified at a higher level. For example:
<cesAna xml:base="http://www.xml-ces.org/" ...> ... <chunkList xml:base="corpus/en/"> <chunk xml:base="text1.xces" xlink:href="#xpointer(id('p1s1')/range-to(id('p1s8')))"> <tok xlink:href="#xpointer(id('p1s1w1')/range-to(id('p1s1w2)))"/> ... </chunk> <chunk xml:base="text2.xces" xlink:href="#xpointer(id('p1s9')/range-to(id('p2s10')))"> <tok xlink:href="#p1s9w1"/> ... </chunk> ... </chunkList> ... </cesAna>
In the CES the <link> element uses the targets or xtargets attribute to specify a semi-colon delimited list of fragments being aligned. In the XCES the <link> element has been changed to an XLink extended link (xlink:type="extended") that contains a sequence of <align> elements (xlink:type="locator") used to identify the fragments being aligned.
DOC1: <s id="p1s1">According to our survey, 1988 sales of mineral water and soft drinks were much higher than in 1987, reflecting the growing popularity of these products.</s> <s id="p1s2">Cola drink manufacturers in particular achieved above- average growth rates.</s> <!-- ... --> DOC2: <s id="p1s1">Quant aux eaux minérales et aux limonades, elles rencontrent toujours plus d'adeptes.</s> <s id="p1s2">En effet, notre sondage fait ressortir des ventes nettement supérieures à celles de 1987, pour les boissons à base de cola notamment.</s> CES ALIGN DOC: <linkGrp targType="s"> <link xtargets="p1s1 ; p1s1"> <link xtargets="p1s2 ; p1s2"> </linkGrp>
Becomes:
XCES ALIGN DOC: <linkGrp targType="s"> <link> <align xlink:href="#p1s1"/> <align xlink:href="#p1s1"/> </link> <link> <align xlink:href="#p1s2"/> <align xlink:href="#p1s2"/> </link> </linkGrp>
The order of the <align> elements within a <link> element is significant. Unless otherwise specified the order is assumed to match the ordering of <translation> elements in the header. If a different ordering is required the attribute n in the <translation> element and the attribute n in the <align> element can be used to explicitly link an <align> element with a specific translation. Many-to-one alignments and many-to-many alignments can be represented by providing a range for the XPointer expression. N-to-zero alignments can be indicated by omitting one or more of the <align> elements and using the n attribute to specify which translations any remaining <align> elements refer to. Alternatively, the href attribute can be set to #xces:undefined to indicate that there is no translation for that fragment in that language.
header.xml: <cesHeader version="2.3"> ... <translations> <translation trans.loc="text-fr.xml" xml:lang="fr" wsd="ISO8859-1" n="1"> <translation trans.loc="text-en.xml" xml:lang="en" wsd="ISO8859-1" n="2"> <translation trans.loc="text-ro.xml" xml:lang="ro" wsd="ISO8859-1" n="3"> <translation trans.loc="text-cz.xml" xml:lang="cz" wsd="ISO8859-1" n="4"> </translations> </cesHeader> align.xml: <cesAlign type="sent" version="1.6"> <cesHeader xlink:href="header.xml"/> <linkList> <!-- sentence alignments --> <linkGrp domains="d1 d1 d1 d1" targType="s"> <link> <!-- Same ordering as translation elements [fr,en,ro,cz] --> <align xlink:href="#s1"/> <align xlink:href="#s1"/> <align xlink:href="#s1"/> <align xlink:href="#s1"/> </link> <link> <!-- Reverse order [cz,ro,en,fr] --> <align n="4" xlink:href="#s2"/> <align n="3" xlink:href="#s2"/> <align n="2" xlink:href="#s2"/> <align n="1" xlink:href="#s2"/> </link> <link> <!-- No English translation [3ro,2cz,1fr]--> <align n="3" xlink:href="#xpointer(id('s3')/range-to(id('s5')))"/> <align n="4" xlink:href="#xpointer(id('s3')/range-to(id('s4')))"/> <align n="1" xlink:href="#s3"/> </link> <link> <!-- 3rd align is fr, the rest are taken in order of translation [1en,1ro,2fr,0cz] --> <align xlink:href="#s3"/> <align xlink:href="#s4"/> <align n="1" xlink:href="#xpointer(id('s4')/range-to(id('s5')))"/> <align xlink:href="#xces:undefined"/> </link> ... </linkGrp> </linkList> </cesAlign>
Frequently the data that is to be aligned or annotated is not marked up in a suitable format: for example, when sentence alignment is provided for target documents that are marked only to the paragraph level, or when annotation is stored separately to allow for multiple parallel annotations of the same phenomenon. The following provides a simple example of stand-off annotation.
These files are meant as examples only. The French translation was performed at http://babelfish.altavista.com. Some words have been purposely tokenized incorrectly (i.e. C'est is marked as one word so it aligns with two English words).
Download zip archive with all example files
We would like to thank Altova GmbH and Altova Inc. for providing their XML Spy Suite software to be used in the development of the XCES, in the context of the American National Corpus project. |