Canonical XTM

A canonical serialization format for topic maps

By:Lars Marius Garshol
Affiliation:Ontopia A/S
Date:$Date: 2001/06/29 08:21:16 $
Version:0.1

Table of contents

1. Preliminaries
2. Serialization
2.1. <topic>
2.2. <instanceOf>
2.3. <subjectIdentity>
2.4. <resourceRef>
2.5. <subjectIndicatorRef>
2.6. <baseName>
2.7. <scope>
2.8. <baseNameString>
2.9. <variant>
2.10. <variantName>
2.11. <occurrence>
2.12. <resourceData>
2.13. <association>
3. Ordering principles
3.1. Topics
3.2. <instanceOf>, <topicRef>, <member>
3.3. Topic names, base names
3.4. Occurrences, subject indicators
3.5. Variant names
3.6. Associations
3.7. Association members
4. Canonical XTM DTD

1. Preliminaries

This specification describes a serialization format for topic maps which has the property that all logically equivalent topic maps have the exact same byte-by-byte representation in this format. This can be used to test the conformance of XTM processors.

This document is not an official document in any sense; it is just a proposal for the consideration of the topic map community. The contributions of Geir Ove Grønmo are gratefully acknowledged.

The specification describes the serialization of a topic map into an output document, but does not concern itself with where that topic map came from. It is not a goal to ensure that the canonical topic map can be successfully read into an XTM processor, but merely to confirm that all processing defined by the XTM 1.0 specification has been performed correctly.

The topic map must before serialization be processed into consistent topic map, as defined by XTM 1.0. When applying canonicalization to XTM documents no string normalization such as Unicode canonical decomposition must be performed. (This should be based on a topic map data model, which would define this for us.)

The output document must be a canonical XML document. In addition, a line feed (U+00A0) must be inserted after every end tag and likewise after every start tag of elements that have element content or are empty. (This means <baseNameString>, <resourceData>, <topicRef>, <instanceOf>, <resourceRef>, <subjectIndicatorRef>.)

[Remark: Must handle: sorting of topics that have no characteristics and class-instance topic relationships with scope.]

2. Serialization

The document element must be a <topicMap> element with the xmlns attribute value set to http://www.topicmaps.org/cxtm/1.0/.

The topic map is serialized by first writing out all topics, and then writing out all associations. Since only one topic map is output, there is no mergemap information to serialize.

2.1. <topic>

Topics are sorted by their sort keys (see the Ordering principles section) and then serialized in that order. All <topic> elements must have an id attribute, set to the value 'idN', where N is the number of the topic in sort order, starting with 1.

Topics are serialized by first writing out all class-instance relationships as <instanceOf> elements, then the <subjectIdentity> element, then all <baseName>s, then all <occurrence>s. The <instanceOf>, <baseName> and <occurrence> elements are ordered according to the rules in the 'Ordering principles' section.

2.2. <instanceOf>

A class-instance relationship is serialized as an <instanceOf> element, with the 'href' attribute set to the ID of the <topic> element representing the class topic, with the character '#' prepended.

Note that the <instanceOf> element is an empty element, and so, according to the Canonical XML specification must be serialized with both a start and an end tag, with nothing between the tags.

2.3. <subjectIdentity>

If the topic has no addressable subject, nor any known subject indicators, this element is not output at all.

If the topic has an addressable subject, that is output first using a <resourceRef> element.

For each subject indicator the topic has, a <subjectIndicatorRef> element is output. The elements must be ordered according to the ordering principles.

2.4. <resourceRef>

The <resourceRef> element is an empty element, holding the reference to the resource in its 'href' attribute.

2.5. <subjectIndicatorRef>

The <subjectIndicatorRef> element is an empty element, holding the reference to the subject indicator in its 'href' attribute.

2.6. <baseName>

Each topic name is serialized using a <baseName> element. First the scope is written out using the <scope> element, then the base name value in the <baseNameString> element and finally the variant names using <variant> elements. The variant names must be ordered according to the ordering principles.

2.7. <scope>

If the scoped topic map construct has an empty scope, this element is not output at all. If it has a non-empty scope, references to the topics making up that scope are written out using <topicRef> elements in the order defined by the ordering principles.

Note that in all cases the scope that is output must consist of the scope resulting from inheriting the scope of any parent elements that have scope. The scope of variant names therefore consists of the union of their own scope and those scope of all their ancestors.

2.8. <baseNameString>

Contains the base name value.

2.9. <variant>

Each variant name is serialized using a <variant> element. First its parameters are written out using the <scope> element, then the variant name value in the <variantName> element and finally any child variant names using <variant> elements. The variant names must be ordered according to the ordering principles.

2.10. <variantName>

Contains the variant name value.

2.11. <occurrence>

Each occurrence is written out using an <occurrence> element. If the occurrence is an instance of a class an <instanceOf> element is output, followed by a <scope> element representing the scope of the occurrence (provided it is non-empty) and last followed by a <resourceRef> element if the occurrence is an external resource or a <resourceData> element if the occurrence is an internal resource. [Remark: This is probably too vague]

2.12. <resourceData>

Contains the resource inline.

2.13. <association>

Associations are serialized using <association> elements, which first contain an <instanceOf> element (if the association is an instance of a class), a <scope> element (unless the association is in the unconstrained scope), and finally a <member> element for each participating topic in the association. The <member> elements must be ordered according to the ordering principles.

3. Ordering principles

This section establishes how to determine the sort key value of each topic map element that is written out. This is used to ensure that all elements are serialized in a specific order. That order is obtained by sorting the elements according to their sort keys in lexicographical order, based on UCS code point values.

3.1. Topics

If the topic has an addressable subject, the URI of that resource is the sort key.

Failing that, if the topic has a subject indicator, the URI of the first subject indicator (as ordered according to these rules) is the sort key.

Failing that, if the topic has occurrences, the URI of the first occurrence (as ordered according to these rules), with '$' prepended, is the sort key.

Failing that, if the topic has base names, the base name values of all the base names of the topic, separated by '|' characters and ordered by their base name values, is the sort key.

[Remark: And what if there are no base names either?]

3.2. <instanceOf>, <topicRef>, <member>

The sort key is the ID of the topic element referred to.

3.3. Topic names, base names

The sort key is constructed by appending the following into a string: the base name value. If the base name has scope, the sort key is extended by appending a '|' character, followed by the assigned IDs of all topics in the scope of the topic name separated by spaces and ordered according to these principles.

3.4. Occurrences, subject indicators

The sort key is the URI of the resource. [Remark: What about resourceData?]

3.5. Variant names

The sort key is constructed by appending the following into a string: the variant name value, followed by a '|' character, followed by the assigned IDs of all topics in the scope of the variant name separated by spaces and ordered according to these principles.

3.6. Associations

The sort key is the sort keys of all its members in sort order, separated by '|' characters. If the association is an instance of a class, a '$' character is appended, followed by the assigned ID of the topic representing that class.

3.7. Association members

The sort key is the ID of the topic element referred to by its <topicRef> child if the member has no specified role. If it does, a space and the assigned ID of the topic defining the role are appended.

4. Canonical XTM DTD

<!ELEMENT topicMap (topic*, association*)>
<!ATTLIST topicMap 
          xmlns       CDATA "http://www.topicmaps.org/cxtm/1.0/" #FIXED>

<!ELEMENT topic (instanceOf*, subjectIdentity?, baseName*, occurrence*)>
<!ATTLIST topic
          id          ID    #REQUIRED>
          
<!ELEMENT instanceOf EMPTY>
<!ATTLIST instanceOf 
          href        CDATA #REQUIRED>

<!ELEMENT subjectIdentity (resourceRef?, subjectIndicatorRef*)>

<!ELEMENT resourceRef EMPTY>
<!ATTLIST resourceRef
          href        CDATA #REQUIRED>

<!ELEMENT subjectIndicatorRef EMPTY>
<!ATTLIST subjectIndicatorRef
          href        CDATA #REQUIRED>


<!ELEMENT baseName (scope?, baseNameString, variant*)>

<!ELEMENT scope (topicRef+)>
<!ELEMENT topicRef EMPTY>
<!ATTLIST topicRef
          href        CDATA #REQUIRED>

<!ELEMENT baseNameString (#PCDATA)>


<!ELEMENT variant (scope, variantName, variant*)>
<!ELEMENT variantName (resourceData | resourceRef)>
<!ELEMENT resourceData (#PCDATA)>


<!ELEMENT occurrence (instanceOf?, scope?, (resourceRef | resourceData)>


<!ELEMENT association (instanceOf?, scope?, member+)>
<!ELEMENT member (instanceOf?, topicRef)>