Beyond the need to query and retrieve based on tags which exist in a TEI document, a means to manipulate and query classes of objects is also desirable. The TEI DTD uses SGML entity definitions to create "classes" of elements and attributes, in particular, for groups of elements with common structural properties (e.g., all elements that can appear between paragraphs), groups of attributes which apply to certain classes of elements (e.g., attributes for pointer elements), etc. In addition to grouping together elements and attributes with common structural properties, the definition of such classes recognizes common semantic properties among elements and attributes. However, the SGML entity definition mechanism provides only for string substitution within the DTD itself, thereby enabling easy reference to these classes in later element definitions; the common semantic properties that are implicit in the classification scheme are lost for the purposes of retrieval and document manipulation. Obviously, a means to refer to and manipulate classes of elements and attributes in a query and retrieval system would provide substantial additional power for the user.
We are experimenting with the representation of a DTD and associated documents (i.e., documents conformant to the DTD) in a knowledge representation (KR) system, in order to provide more sophisticated query and retrieval from TEI documents than current systems provide. We are using CLASSIC, a frame-based representation system developed at AT&T Bell Laboratories [2]. Like many KR systems, CLASSIC enables the definition of structured concepts/frames, their organization into taxonomies, the creation and manipulation of individual instances of such concepts, and inference such as inheritance, relation transitivity, inverses, etc. In addition, CLASSIC provides for the key inferences of subsumption and classification [4]. By representing a document as an individual instance of a hierarchy of concepts derived from the DTD, and by allowing the creation of additional user-defined concepts and relations, sophisticated query and retrieval operations can be performed. This paper briefly describes the CLASSIC system, the representation of a DTD and a document conforming to that DTD in CLASSIC, and provides an overview of the kind of query and retrieval that can be performed.
Historically, CLASSIC is a descendant of KL-ONE, and is in the family of languages known as description logics. Such languages attempt to deal with terminological reasoning and automatic classification [5].
There are three kinds of formal objects in CLASSIC:
CLASSIC provides for the following kinds of inference:
The roles in the derived ontology are not meant to capture the complete syntactic meaning of the DTD. The contains role, for example, simply indicates that one element can be structurally contained in another. The first and next roles are used to represent the order. For example, the element definition:
<!ELEMENT keywords - - (term+ | list) >would correspond to the following concept definitions:
keywords :: (and dtd-element (all contains (or term list)) (all first (or term list))) term :: (and dtd-element (all next term))This is an over-simplication, which is actually incorrect in a number of details, but the example illustrates that the syntax specified in the DTD is loosely captured in the ontology as containment and sequencing restrictions. In this example, individuals of KEYWORD are restricted to having CONTAINS roles filled with individuals of TERM or LIST, and having FIRST roles with the same restriction. Note that this is not necessarily a connection between the concept KEYWORDS and the concept TERM (or LIST), but is rather a restriction that is passed on to individuals of KEYWORDS. Attributes of elements are captured as roles with simple string values (with one exception, noted in the next section). The restrictions on the attribute values specified in the DTD are not represented. For example, the following attribute definition in a DTD:
<!ATTLIST div %a.text; complete (y | n) >specifies that the values of the COMPLETE attribute must be either Y or N. The ontology generated for this will simply say:
div :: (all complete string)(in addition to any other structural information specified for this element) which says the complete role must have a string for a value, and does not restrict the string to "Y" and "N".
Note that our system does not do error checking on documents; there are many tools to test DTD conformancy, so it is not necessary to capture these restrictions.
Once defined as members of a class, groups of elements can be queried, manipulated, etc. as a group, rather than as a list of individuals. In addition, properties can be associated with any class which are then automatically "inherited" by all members of that class; for example, if the class name is defined as having sub-parts FIRSTNAME and LASTNAME, then all members of the class person-name are defined to have those sub-parts. Queries can also exploit inherited properties.
The type attribute is treated specially when generating an ontology from a DTD. In the CES DTD, the type attribute is used consistently to denote subtypes. For example, the element name has a type attribute which has values such as PERSON, ORGANIZATION, etc. Some elements in the DTD specify a closed list of values for the TYPE attribute (the measure element in the CES DTD defines the possible values of type to be one of weight, length, count, area, volume, temperature, currency), but most do not. If a range of values is specified, each of these becomes a subconcept of the concept representing the element. If not, the user may create subconcepts manually, or may wait for these subconcepts to be created automatically when documents are processed (see below).
Roles may also be associated with any CLASSIC concept, enabling the creation of more complex relations. For example, a concept person can be established with a role name; this role can be filled by any individual object (element) which is a person-name, such as NAME, AUTHOR, etc. Another role, affiliation, could also be associated with person and filled by an individual of an appropriate concept/element such as organization.
It is important to note that we are here defining semantic relationships, as opposed to the syntactic relationships defined by the DTD. For example, in a marked-up document in which a P element occurs within a BODY element, it is tempting to refer to the BODY element as a parent of the P element, since this is true for the parse tree representing the document. However, here we refer to this kind of structural relationship using the contains role, and we would therefore say the BODY element contains the P element.
In our terminology, one element is the parent of another only when it is more general--analogous to a superclass in an object-oriented language. However, there is no way to express this kind of parent relationship in SGML. The CES DTD and the TEI DTD both attempt to use entities to create "classes" of elements and attributes, but because entities are only a string-replacement mechanism, there is no means by which class membership can be recognized or manipulated by typical SGML-aware software. The ability to manipulate and query on the basis of class membership is one of the most important and promising aspects of the CLASSIC representation for SGML documents.
<HTML> <TITLE>Example</TITLE> <BODY><P>This is a simple example.</P></BODY> </HTML>would correspond to the following individuals in CLASSIC :
HTML-1 :: (and HTML (fills contains TITLE-2 BODY-3) (fills first TITLE-2)) TITLE-2 :: (and TITLE (fills contains CDATA-4) (fills first CDATA-4) (fills next BODY-3)) BODY-3 :: (and BODY (fills contains P-5) (fills first P-5)) P-5 :: (and P (fills contains CDATA-6) (fills first CDATA-6)) CDATA-4 :: (and CDATA (fills text "Example")) CDATA-6 :: (and CDATA (fills text "This is a simple example."))The name of each individual is generated automatically, and consists of the name of the most specific parent concept and a number to make the name a unique symbol. These names have no meaning to CLASSIC; they are simply place holders.
The main difference between a KR system like CLASSIC and a relational database is the ability to specify inferences that eliminate the need for a user to be completely familiar with the DTD. That is, knowledge of what the structural relationships in the document actually mean can be specified in the ontology and used to make the job of searching for information easier for the user.
For example, a marked-up document conforming to the CES DTD might contain a NAME element, with type=PERSON, within the BODY part of a document. When this element is encountered, our system will read the type attribute and, if this is the first time that type has been encountered, it will create a new concept with NAME as its parent, and then create an individual of that new concept that represents the tagged text.
Now suppose we decide that "any person-name occuring within the BODY of a document is a character." We create a new role called character, which is attached only to documents, and we can specify that the character role of a document is filled automatically with values according to the rule "any person-name occuring within the BODY of a document is a character" (the syntax for specifying such a rule is beyond the scope of this article). Subsequently, retrieval can be performed on the basis of this defined relationship; for example, all characters can be retrieved, or all characters with a certain set of characteristics, etc.--even if there is no tag for character. Note also that such queries would differentiate names applying to persons vs. names applying to places, organizations, etc., as well as names appearing in the body and names appearing elsewhere (for example, in the header). This overcomes a problem that arises due to the lack of scoping rules in SGML: for example, the element NAME may have a very different semantics depending on whether it appears in the header or in the body of the text, but SGML provides no way to differentiate instances of a given tag on the basis of context. [6]
The representation of DTDs and associated documents in CLASSIC provides considerable facility for document query. Details are excluded due to space restrictions, but consider in particular the following capabilities:
The representation of SGML documents in CLASSIC may also have repercussions for DTD design. So far, DTD design has been largely unprincipled, and wide variations in the kinds of information represented by elements, attributes, and tag content often occur, even within the same DTD. However, the formal representation of elements, attributes, and content as CLASSIC objects demands consistency in their use within the DTD. The development of a set of principles for DTD design is a desideratum among the encoding community; we are looking into the ways in which formatlization of DTDs in CLASSIC can contribute to this development.
At the same time, the use of a system such as CLASSIC allows for greater flexibility in tagging text. For example, for names of people, the encoder can use the general NAME tag--or even more generally, RS (referring string)--or provide a very precise encoding using PERSNAME with FIRSTNAME, LASTNAME, etc. elements inside. Once represented in CLASSIC, these objects can be both recognized as members of the class person-name and accessed and manipulated as such. This frees the encoder to choose an encoding for each name that is appropriate; there is no need for absolute consistency to enable the semantic identity of the two elements to be recognized. More generally, it allows precise tag semantics to be instantiated in a system external to the encoded text.
[2] Brachman, R, Borgida, A., McGuinness, D., and Resnick, L. The CLASSIC Knowledge Representation System. In Proceedings of the 11th IJCAI. August, 1989.
[3] Ide, N., Priest-Dorman, G., Véronis, J. (1996). Corpus Encoding Standard. Documentation and DTDs available at http://www.cs.vassar.edu/CES/.
[4] Brachman, R. What is-a is and isn't. IEEE Computer. October, 1983. pp 30-36.
[5] Minsky, M. A Framework for Representing Knowledge. Mind Design. MIT Press. Pp. 95-128. 1981.
[6] The CES attempts to solve this problem by renaming elements such as NAME, TITLE, AUTHOR, etc. when they appear in the header (e.g., H.NAME, H.TITLE, etc.). However, this solution is clearly ad hocand only complicates the information the use is required to remember if retrieval is based solely on the specifications in the DTD.
[6] Welty, Chris. Intelligent Assistance for Navigating the Web. Proceedings of The 1996 Florida AI Research Symposium. May, 1996.