An Approach to DTDs and Namespaces

Introduction

JUMBO now implements a simple but powerful approach to DTDs and namespaces, intended to follow both the spirit and letter of XML. This has been implemented in an imminent new snapshot of JUMBO (i.e. not vapourware) and feedback is welcomed, including areas where my knowledge of SGML is shallow.

In textual applications (rendering of 'human-readable' documents on paper or screen as in current (1997) browsers) stylesheets are the preferred approach to 'display'. In this spirit JUMBO is tracking the XSL spec and provides partial support at present. However, many applications are 'non-textual' either because of the nature of their material (molecules, semantic maths, structured graphics, etc.) or because of their structure (general graphs, tables, etc.). In these cases a per-element approach is often valuable and JUMBO currently provides support by linking element display to Java classes. This leads to a simple model for namespaces which may also be useful for some textual applications as well.

The merits and limitations of the XML DTD

Traditional management of XML documents is through the DTD which can provide the following:  Of these only (a) and (b) impinge on Namespaces and DTDs. JUMBO never sees (c) (d) and (g), may never see (e) and its author has publicly demonstrated that he does not understand (f) fully.

Content models

(a) allows a powerful definition of the potential structures for an element's content. It is useful for validating static XML documents, for creating new documents, and for editing or merging existing ones. JUMBO thinks it is a Good Thing and will use it whenever possible (it awaits a publicly available Java algorithm for content validation).

It has limitations in the following areas:

The first is serious since almost all non-textual applications of XML (databases, technical subjects, commerce and many more) use datatyping such as INTEGER, DATE, FLOAT, etc. An example of the second is where occurrence counts of children are substantial (e.g. 'FAMILYs with more than five and less than eight CHILDren' have ugly content models. [By contrast XLL with NodeSets could provide an elegant runtime validation: (ALL,FAMILY)CHILD(5,CHILD).NOT.(ALL,FAMILY)CHILD(9,CHILD)
but this is not followed further here.]

Therefore datatyping must be addressed at an early stage in authoring, editing and processing XML documents, and a DTD-compatible solution is discussed below.
 

Attribute validation

Attributes can be validated with respect to The Typing suffers from the same problems as content (above). I suspect very few newcomers to XML will use anything other than CDATA for attributes as they won't understand the point of the other XML types (except possibly ID). [Personally I can see no reason for having IDREF since XLL is more powerful. IDREF is a pain to implement and unless anyone convinces me otherwise I shall not put it in JUMBO. I might transform IDREF="foo" to HREF="#ID(foo)" which would do the same thing.]

Enumerations suffer from:

It is perhaps worth noting that AFAICS no current Java XML parser provides full support for extracting DTD-based information, and I suspect that this is likely to be common among lightweight parsers. Without good support the DTD may be in danger of atrophy, perhaps limited to a few ATTLISTs in the internal subset. This paper (hopefully) adds new life.
 

The per-Element approach

In the existing DTD there are two types of information - document-wide, and per-element. The latter covers (a) and (b) above and is the subject of this document. I use 'per-element' to mean that an element can be completely described and processed without knowledge of its position in a document. IMO contextual information is best provided by XLL, with (hopefully) widely agreed semantics (not discussed here).

The structure of an element as presently defined therefore breaks down into:

<!ELEMENT element (contentspec, attlist*)>
<!ELEMENT contentspec (#PCDATA)>
<!ELEMENT attlist (#PCDATA)>

Since #PCDATA is a poor descriptor of structure, the first is better expressed [3.2.1] as:

<!ELEMENT contentspec (#PCDATA|children|Mixed)> <!-- #PCDATA is 'EMPTY' or 'ANY'-->
<!ELEMENT children (choice | seq)>
<!ATTLIST children
    repeatable (YES|NO) #REQUIRED
    optional (YUP|NOPE) #REQUIRED> <!-- I can't use YES|NO again :-( -->
<!ELEMENT choice (cp)+>
<!ATTLIST choice
    repeatable (YES|NO) #REQUIRED
    optional (YUP|NOPE) #REQUIRED>
<!ELEMENT seq (cp)+>
<!ATTLIST seq
    repeatable (YES|NO) #REQUIRED
    optional (YUP|NOPE) #REQUIRED>
<!ELEMENT cp (Name|choice|seq)>
<!ATTLIST cp
    repeatable (YES|NO) #REQUIRED
    optional (YUP|NOPE) #REQUIRED>
<!ELEMENT Name (#PCDATA)>
<!ATTLIST Name
    type (STRING|INTEGER|FLOAT|DATE|URL|HTML|OTHER) "STRING">
<!ELEMENT Mixed (#PCDATA,Name*)>

JUMBO essentially implements this and can display DTD contentspecs as trees (It was written before the 971208 spec, so details differ. BTW the spec is much clearer in this area, thanks.)

Similarly the ATTLIST structure [3.3] (again simpler) breaks down into:

<!ELEMENT attlist (AttDef)*> <!-- I would have expected (AttDef)+ -->
<!ELEMENT AttDef (Name, AttType, Default)> <!-- Name as above -->
<!ELEMENT AttType (#PCDATA | EnumeratedType)> <!-- PCDATA is 'CDATA', 'ID' etc. as from [55] and [56] -->
<!ELEMENT EnumeratedType (NotationType | Enumeration)>
<!ELEMENT NotationType (#PCDATA, Name*)> <!-- PCDATA is 'NOTATION' -->
<!ELEMENT Enumeration (#PCDATA)> <!-- PCDATA is '(A|B|C|D)' -->

This means that the ELEMENT and ATTLIST components of the DTD can be isomorphically expressed by an XML document.

This is not a world-shattering discovery, and it has been made by many people. The  key point, however, is that it is far more powerful than the conventional BNF-like DTD. There are several advantages, and the only disadvantage is that it is not formally supported and encouraged by the XML spec. This requires the DTD to use a different language from XML (but even this hurdle can be overcome - see later). The advantages of the XML-DTD are:

Namespaces

In their simplest form, namespaces are simply a restatement of DTDs - each DTD describes a namespace. JUMBO honours this. In full SGML the SUBDOC facility allowed multiple DTDs (= namespaces) but XML does not have this. It is confidently assumed that XML will use namespaces instead and at least one proposal (XML-data) has already been published. That proposal had many useful features, but the underlying relational and inheritance structure would be too complex for JUMBO to implement at present. This current document proposes a much simpler approach which hopefully is very close to the core of any namespace proposal. It does not quote or rely on any non-public material or discussion.

All that is so far publicly given and used is:

The JUMBO Approach

(Almost everything described here is implemented in a working prototype, so should be seen as generally feasible. Some details (e.g. the syntax of 'xml:namespace', capitalisation of HREF) are tentative (but work). Some terminology may also be obsoleted by

All documents are in XML This makes parsing, processing, display, editing, and everything a lot easier than having multiple syntaxes.

An XML document instance may have zero, one, or many PIs of the form:

<?xml:namespace href="some/where.xml" as="foo" ?>
<?xml:namespace href="else/where.xml" as="bar" ?>

JUMBO does NOT regard the address of the URL as important since it could be relative or absolute. Only the contents matter.

The Namespace schema

The namespace schema identifies the elementTypes in that schema. The proposed format is something like:

<NAMESPACE FPI="-//CML//DTD Version 1.2 EN//" PREFIX="CML">
  <ELEMENT TYPE="MOL">
    <SCHEMA>jumbo/cml/MOLNode</SCHEMA>
  </ELEMENT>
  <ELEMENT TYPE="ATOMS">
    <SCHEMA>jumbo/cml/ATOMSNode.xml</SCHEMA>
  </ELEMENT>
</NAMESPACE>

Minor details could be that TYPE was a child ELEMENT, etc.

(Because of the power of Xpointers to abstract subcomponents, much other information can be added to the schema without causing problems. Essentially it is transparent to the current process. Examples would be metadata, display characteristics of namespaces, etc. For example, JUMBO can make buttons different colours :-)

The schema points to per-element schemas for each element in the namespace (only two are shown above). The addresses are URLs and can be relative or absolute. MOL belongs to the CML namespace and would normally appear in a document as <CML:MOL>. However, the implementation also allows for a single namespace with no prefix ('MOL' with as=""). If two competing namespaces occur (e.g. each has namespace of "CML"), the namespace file can be easily constructed with a different PREFIX.

Per-element Schemas

Each element is described by WF XML of the form below. This could be a file-per-element (which is what JUMBO does) or could use Xpointers into a single file. Note that the ELEMENTs do NOT have a hardcoded PREFIX. The ELEMENTS could have the XML structure outlined above or they could contain PCDATA representations of conventional DTDs. JUMBO supports the latter (and will support the former) so that the result is like:

<ELEMENT TYPE="MOL">
  <NAMESPACE FPI="-//CML//DTD Version 1.2 EN//"/>
  <CONTENTSPEC>(ATOMS,BONDS?)</CONTENTSPEC>
  <ATTLIST>BUILTIN CDATA #IMPLIED</ATTLIST>
  <ATTLIST>ID ID #REQUIRED</ATTLIST>
</ELEMENT>

A great advantage of this is that many other non-DTD element-related materials can be inserted as in:

<ELEMENT TYPE="MOL">
  <!-- ...content and attributes ... -->
  <HELPURL>MOLNodeHelp.xml</HELPURL>
  <JAVA>jumbo.cml.MOLNode.class</JAVA>
  <ICONURL MIME="image/gif">../icons/mol.gif</ICONURL>
  <STYLESHEET>http://www.chem.soc/stylesheet.xsl</STYLESHEET>
</ELEMENT>

One obvious attraction is that the elements can easily be used in more than one application. JUMBO is starting to do this for simple elements like jumbo.tecml.INTEGERNode. JUMBO implements a set of fundamental ELEMENTs for basic data types:

jumbo.tecml.INTEGERNode;
jumbo.tecml.FLOATNode;
jumbo.tecml.DATENode;
jumbo.tecml.STRINGNode;
jumbo.tecml.URLNode;
jumbo.tecml.HTMLNode;

These are only validated as PCDATA at parser level, but JUMBO checks formats and semantic validity (using java.* classes where possible). These classes also implement functions like display(), edit(), isValid() for use in renderers and authoring tools.

Mixing namespaces

In real applications it is highly likely that ELEMENTs can have content from another DTD. For example, a MOL could contain an RDF:* identifying the metadata for that molecule. The current validation procedure with a fixed contentspec will therefore fail. The following are possible mechanisms: As an example of the second we might include a reserved elementType (e.g. XDEV:OTHER) which signified that an element from another namespace was allowable here. (I'd actually prefer the spec to address this.). The validation procedure could be modified to allow this to match any element NOT in the namespace related to the content. The third is the most powerful and could include constructions of the sort:
<ELEMENT>
  <NAME>&namespace:&name</NAME>
  <CONTENT>(&foo;|&bar)</CONTENT>
</ELEMENT>
which effectively act as PEs and seem to have at least as much power, if not more. By reasonable use of entities, any desired DTD can be created just before parsing.

Summary

This mechanism has been tested on documents with 3 namespaces, all linked to per-element schema files. It has greatly aided the creation of authoring tools, which can now use the full contentspec and ATTLISTs. It will be present in the next snapshot of JUMBO and comments before that time will be valued and implemented if possible.

Peter Murray-Rust

peter@ursus.demon.co.uk