Programming Guide


How to get a parse tree

The Simplest Example

import com.ibm.xml.parser.*;
....
        String filename;
        ....
        InputStream is = new FileInputStream(filename);
        TXDocument doc = new Parser(filename).readStream(is);
        is.close();

Parser#readStream() never returns null. In this way, the parser prints parse errors to the standard error stream.

To access a parse tree, use TXDocument#getDocumentElement() (see How to operate). TXDocument#getDocumentElement() may return null when the XML document has serious errors.

A Parser instance cannot be reused. An application can call the Parser#readStream() method only once.

NOTE: A TXDocument instance generated by Parser is also an instance of org.w3c.dom.Document. This is a DOM object tree.

You can restructure the parse tree into a stream in XML format.

    String charset = "ISO-8859-1";
    String jencode = MIME2Java.convert(charset);
    PrintWriter pw
        = new PrintWriter(new OutputStreamWriter(System.out, jencode));
    doc.setEncoding(charset);
    doc.print(pw, jencode);

Set parsing options

You can configure the parser's behavior after making a Parser instance before readStream() is called.

import com.ibm.xml.parser.*;
....
        String filename;
        ....
        Parser parse = new Parser(filename);
        parse.setWarningNoDoctypeDecl(false);
        parse.setWarningNoXMLDecl(false);
        InputStream is = new FileInputStream(filename);
        TXDocument doc = parse.readStream(is);
        is.close();

Redirect parsing errors

You can control the output of errors produced by the parser. Make an instance of a class implementing ErrorListener, and specify the instance to the Parser constructor.

The Object key parameter of an error() method is an instance of String or Exception.. When key is String, it means a type of error (See the source com/ibm/xml/parser/r/Message.java).

See the sources com/ibm/xml/parser/trlxml.java and com/ibm/xml/parser/Stderr.java .


How to operate a parse tree

A TXDocument can have one TXElement instance, zero or one DTD instance, and instances of TXPI and TXComment as children. All children of TXDocument can also be accessed with TXDocument#getChildren() / TXDocument#getChildrenArray(). The TXElement instance can be accessed with TXDocuemnt#getDocumentElement() also.

TXElement can have some instances of TXElement, TXText, TXPI and TXComment as children. All children of TXElement can be accessed with TXElement#getChildren() / TXElement#getChildrenArray().

Some mtehods of TXDocuement and TXElement returns one or more instances of the Child interface. These Child instances are also instances of TXElement or TXText or TXPI or TXComment or DTD(if a child of TXDocument). To know what class an instance belongs to, use Node#getNodeType() or instanceof operator like the following:

import com.ibm.xml.parser.*;
import com.ibm.dom.*;
    ....
    TXDocument doc = ....;
    TXElement root = doc.getDocumentElement();
    NodeEnumerator ne = root.getChildren().getEnumerator();
    Node ch;
    while (null != (ch = ne.getNext())) {
        if (ch instanceof TXElement) {
            TXElement el = (TXElement)ch;
            ....
        } else if (ch instanceof TXText) {
            TXText te = (TXText)ch;
           ....
        }
    }

White Space

The processor keeps all spaces and pass them to applications according to 2.10 White Space Handling in XML 1.0 Proposed Recommendation. The processor sets the IsIgnorableWhitespace flag to TextElement instances that consist only of white spaces.

<MEMBERS>
  <PERSON>Hiroshi</PERSON>
  <PERSON>Naohiko</PERSON>
  <PERSON>
    Kent
  </PERSON>
</MEMBERS>

The processor parses this Element as follows:

TXElement (getName():"MEMBERS", getText():"\n  Hiroshi\n  Naohiko\n  \n    Kent\n  \n")
  TXText ("\n  ", ignorable)
  TXElement (getName():"PERSON", getText():"Hiroshi")
    TXText ("Hiroshi")
  TXText ("\n  ", ignorable)
  TXElement (getName():"PERSON", getText():"Naohiko")
    TXText ("Naohiko")
  TXText ("\n  ", ignorable)
  TXElement (getName():"PERSON", getText():"\n    Kent\n  ")
    TXText ("\n    Kent\n  ")
  TXText ("\n", ignorable)

It is useful to call TXText#trim(String) / TXText#trim(String,boolean,boolean) when an application does not need leading/trailing spaces.


How to get a filtered parse tree

class AElementHandler implements ElementHandler {
    public TXElement handleElement(TXElement el) {
        ....
    }
}

    ....
    Parser parse = new Parser(...);
    parse.setElementHandler(new AElementHandler(), "CHANNEL");
    TXDocument doc = parse.readStream(is);

This ElementHandler#handleElement() method is called after parsing each end tag (</CHANNEL>), and before being added to a parent while processing Parser#readStream(). The parser adds to the parent an TXElement instance returned by handleElement(). If handleElement() returns null, the parser does not add this TXElement instance to the parent.

There are two methods of setting ElementHandler:

Order of Calling ElementHandlers

When more than one ElementHandler is registered in the parser, the parser first calls ElementHandlers for a specific TXElements (first set, first called) and then calls ElementHandlers for all TXElement.

Even if an ElementHandler changes the name of an TXElement, the parser calls other ElementHandlers for the original name. When an ElementHandler returns null, the parser does not call other ElementHandlers.

    Parser parse = new Parser(...);
    parse.addElementHandler(handler1);
    parse.addElementHandler(handler2, "CHANNEL");
    parse.addElementHandler(handler3, "CHANNEL");
    parse.addElementHandler(handler4);
    TXDocument doc = parse.readStream(is);

In this case, when the parser processes the </CHANNEL> tag, the parser calls handler2 first, and calls handler3, handler1, and handler4.


How to make a new XML document

  1. Make a TXDocument instance
    TXDocument doc = new TXDocument();
  2. Construct a tree.
    doc.addElement(...);
  3. Prepare PrintWriter
  4. Set the encoding to TXDocument if the encoding of PrintWriter is not UTF-8.
  5. Output

  6. doc.print(...);
    TXDocument doc = new TXDocument();
    TXElement el = new TXElement("CHANNEL");
    ....
    doc.addElement(el);
    PrintWriter pw
        = new PrintWriter(new OutputStreamWriter(System.out,
                                                 MIME2Java.convert("Shift_JIS")));
    doc.setEncoding("Shift_JIS");
    doc.print(pw);

How to replace classes

If you want to use not TXElement class but a subclass of TXElement, implement the ElementFactory interface and call Parser#setElementFactory().

  1. Design a subclass of the TXElement class
  2. Design a subclass of the DefaultElementFactory class.
  3. Call Parser#setElementFactory() with an instance of the class implementing ElementFactory.
class MyElement extends TXElement {
    ....
}
class MyElementFactory extends DefaultElementFactory {
    ....
}

    ....
    Parser parse = new Parser(...);
    parse.setElementFactory(new MyElementFactory());
    TXDocument doc = parse.readStream(is);
    // doc has not TXElement instances but MyElement instances
NOTE:

ElementFactory#createElement() is called when the processor reaches a start-tag. ElementFactory#ripenElement() is called when the processor reaches an end-tag.


How to query DTD information

Load DTD without loading document

    String systemlit = "http://.../foobar.dtd";
    InputStream is = (new URL(systemlit)).openStream();
    Parser parse = new Parser(...);
    DTD dtd = parse.readDTDStream(is);

What attributes can be set in the element "FOO"?

    Enumeration en = dtd.getAttributeDeclarations("FOO");
    while (en.hasMoreElements()) {
        AttDef attd = (AttDef)en.nextElement();
        // attd.getName() is attribute name
    }

What values can an attribute have?

First, get an AttDef instance by the above method or by DTD#getAttributeDeclaration(String,String).

Next, check the attribute type by means of AttDef#getType(), which returns one of the following values:

TXAttribute.T_CDATA
Any text value.
TXAttribute.T_ENTITIES
A subset of unparsed entity names. Names can be chained with white spaces (" ") if you want to specify more than one value: for example, "name1 name2 name3".
    Enumeration en = dtd.getEntities();
    while (en.hasMoreElements()) {
        EntityValue ev = (EntityValu)en.nextElement();
        if (ev.isNDATA()) {
            // Each ev.getName() is valid value.
        }
    }
TXAttribute.T_ENTITY
One of the unparsed entity names (see above).
TXAttribute.T_ENUMERATION
One of the AttDef#elements().
    Enumeration en = attd.elements();
    while (en.hasMoreElements()) {
        String s = (String)en.nextElement();
        // Each s is valid.
    }
TXAttribute.T_ID
Any name for which DTD#checkID() returns null.
    String newid = ...
    if (null != dtd.checkID(newid)) {
        // Can't use newid
    } else
        dtd.registID(element, newid);
TXAttribute.T_IDREF
One of the registered IDs.
    Enumeration en = dtd.IDs();
    while (en.hasMoreElements()) {
        String id = (String)en.nextElement();
        // The attribute can have one in a set of each id.
    }
TXAttribute.T_IDREFS
A subset of registered IDs. IDs can be chained with white spaces (" ") if you want to specify more than one value.
TXAttribute.T_NMTOKEN
One Nmtoken.
TXAttribute.T_NMTOKENS
A set of the Nmtoken. Nmtokens are cahined with " " when you want to specify more than one values.
TXAttribute.T_NOTATION
One of AttDef#elements().
    Enumeration en = attd.elements();
    while (en.hasMoreElements()) {
        String s = (String)en.nextElement();
        // Each s is valid.
    }

What elements can be inserted into an element "FOO" as a child?

<!ELEMENT PERSON (NAME, HEIGHT, WEIGHT, EMAIL?)>

When using this declaration, you must insert the "NAME" element into the "PERSON" element first, the "HEIGHT" element second, and the "WEIGHT" element third, you can also insert the "EMAIL" element if you want.

Applications can know such rules with DTD#getInsertableElements() / DTD#getAppendableElements().

    TXElement el = new TXElement("PERSON");
    ....
    switch (dtd.getContentType("PERSON")) {
      case 0:
        // This element is not declared.
        break;
      case DTD.CM_EMPTY:
        // No element is insertable.
        break;
      case DTD.CM_ANY:
        // Any element is insertable.
        break;
      case DTD.CM_REGULAR:
        Hashtable tab = dtd.prepareTable("PERSON");
            // This hashtable is reusable for any elements.
        dtd.getAppendableElement(el, tab);
        if (((InsertableElement)tab.get(DTD.CM_ERROR)).status) {
            // This element has an incorrect structure.
        } else {
            Enumeration en = tab.elements();
            while (en.hasMoreElements()) {
                InsertableElement ie = (InsertableElement)en.nextElement();
                if (!ie.name.equals(DTD.CM_ERROR)
                    && !ie.name.equals(DTD.CM_EOC)
                    && ie.status) {
                    if (ie.name.equals(DTD.CM_PCDATA)) {
                        // Can append a TextElement instance to el.
                    } else {
                        // Can append an Element instance named ie.name.
                    }
                }
            }
        }
        break;
    }

Namespace

Namespace spec. is in progress. This implementation is experimental.


Go to README
Last modified: Fri Feb 06 15:49:02 JST 1998