Corpus Encoding Standard - Document CES 1. Annex 9. Version 0.9.2 Last Modified 2 February 1999.

Annex 9

Minimization

Excerpt from "Background and context for the development of a Corpus Encoding Standard"

| Back | CES 1 Table of contents |

There are several possible means to reduce the number of characters added to a text when markup is introduced:

tag minimization: SGML provides many means for minimizing the amount of markup in a text via mechanisms such as start and end tag omission, short start and end-tag, minimization of attribute values, etc. For example, the following definitions allow end tag omission:
The following is a full markup for the sentence fragment "The boat sinks...":
With end tag omission this could be replaced by
which in this case is a nearly 50% reduction in the number of characters.
SGML entities:SGML allows string substitution via entity replacement. Entity references can be used in place of any string, possibly including markup. So, for example, a complex feature structure specification which occurs frequently in the text can be replaced by an entity reference consisting of only a few characters. The TEI feature structure
could be replaced by the entity reference &VBZ;. Analogous substitutions for other word categories could yield the following encoding:
DATATAG feature: When certain tag sequences occur with regularity, it is possible to define a certain character to be interpreted as the end tag of an element. For example, the following declarations specify that the character "|" can be interpreted as the end tag for <orth> and <pos>:
<orth>, <pos>, and <lem> are also defined so as to allow omission of both the start and end tags. This yields the following possible encoding:
If we also specify that the carriage return implies the end-tag of element <w>, the encoding could be reduced even further to
non-SGML notations: It is also possible to use private, less verbose non- SGML schemes within tags or as attribute values. For example, the encoder could decide to use a private notation within the <s> element in the example above--if that notation uses the pipe sign as a separator between word, part of speech, and lemma, the encoding would be exactly as given above. However, the DTD would simply specify
<!ELEMENT s - - (#PCDATA) >
which means that the SGML parser will not process the content of the <s> tag in any way. The content would have to be processed by other software. This is in contrast to the use of DATATAG above, where the SGML parser (assuming the optional feature DATATAG is implemented) will understand and process the content of the <s> tag as consisting of three elements.