Simple API for XML (SAX): Draft Specification (1998-01-12)

[ Up to SAX Home Page ]

Please note that this is a draft, and may be subject to change.

SAX -- the Simple API for XML -- is a simple, common event-based interface for XML parsers written in object-oriented languages. This document presents a draft specification for the interface, with examples in Java (to be replaced with IDL in a future draft).

Contents

Goals and Intended Users

The first version of SAX is designed for the finite set of XML applications that require access only to the logical structure of XML documents. These cover a very wide range of applications, including most browsers, formatters, production systems, database tools, search engines, online transaction processors, and meta-data exchange.

There are, however, some important XML applications -- most notably authoring tools and document repositories -- that require access to purely lexical information such as comments and the boundaries of CDATA sections, character references, and internal entity references. SAX has been designed with an open architecture so that, if desired, future versions may add additional types of handlers for this sort of information, but the special needs of these applications are not supported by the current version of the API, since their required information set can be extremely large.

Components

SAX consists of four core interfaces, one for the parser and three for user-supplied event handlers:

In addition to the core interfaces, SAX implementations contain a convenience base class (for deriving handlers) and exception (in languages that support exceptions):

Parser Interface

Every SAX-conformant XML parser (or front-end driver) must implement the following methods:

void setEntityHandler (EntityHandler handler)
Register an object to receive callbacks related to a document's entity structure (see EntityHandler). You must register the handler before the parse begins. If no handler is registered, the parser will perform default actions specified under the description of EntityHandler.
public void setDocumentHandler (DocumentHandler handler)
Register an object to receive callbacks related to a document's logical structure (see DocumentHandler). You must register the handler before the parse begins. If no handler is registered, the parser will perform default actions specified under the description of DocumentHandler.
public void setErrorHandler (ErrorHandler handler)
Register an object to receive callbacks for errors and warnings (see ErrorHandler). You must register the handler before the parse begins. If no handler is registered, the parser will perform default actions specified under the description of ErrorHandler.
public void parse (String publicID, String systemID) throws Exception
Begin parsing an XML document with the specified public and system identifiers (the system identifier, a URI, is required). This method will not return until the document is completely parsed, or until parsing is halted by an exception or other abnormal condition. Note that in languages that support exceptions, the parse method may throw any exception at all, though except for I/O-related exceptions, the exception will originate in your own handler code rather than in the parser.

In Java, the parser implements an interface named org.xml.sax.Parser; in languages that do no support interfaces, it may extend an abstract base class.

EntityHandler Interface

While SAX concentrates on logical structure, there are two areas where a document's physical structure affects general processing:

  1. the resolution of external identifiers in the XML document itself; and
  2. the resolution of relative URIs in content or attribute values.

In the first case, a user might want to substitute a different URI than the default provided in an XML document, possibly by looking up the public identifier in a table. In the second case, a user might want to resolve a relative URI against the URI of the current external entity. The EntityHandler interface provides the following methods:

public String resolveEntity (String entityName, String publicID, String systemID) throws Exception
Given the system identifier (URI) systemID, possibly accompanied by an entity name (entityName, with the special values "[document]" for the document entity and "[external dtd]" for the external DTD subset) and/or public identifier (publicID), return the system identifier that the parser should use to obtain the entity, or null, to instruct the parser to skip the entity (in which case the parser may report a validation error). In most cases, this method should return the suggested system identifier.
public void changeEntity (String systemID) throws Exception
Handle a change in the current entity URI. The systemID argument specifies the base URI that is now in force.

EntityHandler: Default Behaviour

If the user does not register an EntityHandler, the parser will behave as if the handlers were implemented as follows:

public String resolveEntity (String entityName,
                             String publicID, String systemID)
{
  return systemID;
}

public void changeEntity (String systemID) {}

DocumentHandler Interface

The DocumentHandler interface provides most of the basic functionality of SAX. The parser will inform this interface of basic XML structural events, such as character data and the start and end of elements:

public void startDocument () throws Exception
Receive an event signalling the beginning of the document. This will always be the first callback method invoked, so it is a good place for allocating or initialising structures.
public void endDocument () throws Exception
Receive an event signalling the end of the document. This will always be the last callback method invoked, so it is a good place for finalising or deallocating structures.
public void doctype (String name, String publicID, String systemID) throws Exception
Receive an event signalling a document type declaration. The name argument provides the document type name (also the name of the root element), the publicID argument provides the public identifier for the external DTD subset (or null if none is present), and the systemID argument provides the URI for the external DTD subset (or null if none is present).
public void startElement (String name, AttributeMap attributes) throws Exception
Receive an event signalling the start of an element. The name argument provides the element type name, and the attributes argument provides access to the element's attributes, if any (see the AttributeMap interface). Note that the attributes argument is volatile, and will provide correct results only during the invocation of the startElement method.
public void endElement (String name) throws Exception
Receive an event signalling the end of an element. The name argument provides the element type name.
public void characters (char ch[], int start, int length) throws Exception
Receive an event signalling that character data has been found. The ch argument is an array containing the characters, the start argument provides the starting offset in the array, and the length argument provides the number of characters to read. Note that the ch argument is volatile, and will provide correct results only during the invocation of the characters method -- if you need to use the characters elsewhere, you must copy them.
public void ignorable (char ch[], int start, int length) throws Exception
Receive an event signalling that ignorable whitespace has been found: this event can be generated only if the document has a DTD and the parser is DTD-aware (otherwise, the whitespace will be reported using the regular characters callback). The ch argument is an array containing the whitespace characters, the start argument provides the starting offset in the array, and the length argument provides the number of characters to read. Note that the ch argument is volatile, and will provide correct results only during the invocation of the ignorable method -- if you need to use the whitespace characters elsewhere, you must copy them.
public void processingInstruction (String target, String remainder) throws Exception
Receive an event signalling that a processing instruction has been found. The target argument gives the name which is first (required) part of the processing instruction, and the second part gives the remainder, excluding leading whitespace, or the empty string if there is nothing else.

DocumentHandler: Default Behaviour

If the user does not register a DocumentHandler, the parser will behave as if the handlers were implemented as follows:

public void startDocument () {}
public void endDocument () {}
public void doctype (String name, String systemID, String publicID) {}
public void startElement (String name, AttributeMap attributes) {}
public void endElement (String name) {}
public void characters (char ch[], int start, int length) {}
public void ignorable (char ch[], int start, int length) {}
public void processingInstruction (String target, String remainder) {}

ErrorHandler Interface

This interface gives you a chance to implement your own error handling routines. Upon encountering a fatal error, the behaviour of parsers (after calling the fatal handler) is unspecified: some may attempt to continue parsing normally, some may report errors, and some may stop parsing altogether.

void warning (String message, String systemID, int line, int column) throws Exception
Report a caution or an error that is not serious enough to invalidate the parse. The results of parsing the document may still be usuable, but you may still stop if you wish by using some sort of non-local goto (such as an exception). The message parameter contains a string describing the problem; the systemID parameter contains the URI of the entity that caused the warning, or null if none is available or applicable; the line parameter contains the line number in the relevant entity, or -1 if none is available or applicable; and the column argument contains the offset in the current line, or -1 if none is available or applicable.
void fatal (String message, String systemID, int line, int column) throws Exception
Report an error that is serious enough to invalidate the parse. The results of parsing the document will not be usuable, and you may continue only for the purpose of collecting more errors. The message parameter contains a string describing the problem; the systemID parameter contains the URI of the entity that caused the warning, or null if none is available or applicable; the line parameter contains the line number in the relevant entity, or -1 if none is available or applicable; and the column argument contains the offset in the current line, or -1 if none is available or applicable.

ErrorHandler: Default Behaviour

If the user does not register a ErrorHandler, the parser print a warning to the standard error stream for warning. For fatal, the parser will throw an exception of type XmlException in languages that support exceptions, or will invoke some other sort of non-local goto in other languages.

AttributeMap Interface

This interface represents a map of attributes for a single element. It allows you to retrieve the attribute's value, to check for special characteristics (whether it is an entity, notation, ID, or IDREF), and to look up related information if the attribute value is an entity or notation name (applies only to documents with DTDs parsed with a DTD-aware parser).

public Enumeration getAttributeNames ()
Return an enumeration of attribute names. For languages without a standard Enumeration data type, return an array or list of attribute names.
public String getValue (String attributeName)
Return the attribute's value as a string.
public boolean isEntity (String attributeName)
Return true if the attribute's value is actually the name of an NDATA entity (always false for documents without DTDs).
public boolean isNotation (String attributeName)
Return true if the attribute's value is actually the name of a notation (always false for documents without DTDs).
public boolean isId (String attributeName)
Return true if the attribute's value is a unique identifier for the element (always false for documents without DTDs).
public boolean isIdref (String attributeName)
Return true if the attribute's value is a pointer to another element (always false for documents without DTDs).
public String getEntityPublicID (String attributeName)
If the attribute value is an NDATA entity name and the entity has a public identifier, return the public identifier as a string. If the entity has no public identifier, or if the attribute value is not an entity name, return null.
public String getEntitySystemID (String attributeName)
If the attribute value is an NDATA entity name, return the system identifier as a string; otherwise, return null.
public String getNotationNameID (String attributeName)
If the attribute value is an NDATA entity name, return the name of the associated notation; otherwise, return null.
public String getNotationPublicID (String attributeName)
If the attribute value is an NDATA entity name, this method applies to the associated notation; if the attribute value is a notation name, this method applies to the notation named in the attribute value. Return the notation's public identifier, or null if none is available.
public String getNotationSystemID (String attributeName)
If the attribute value is an NDATA entity name, this method applies to the associated notation; if the attribute value is a notation name, this method applies to the notation named in the attribute value. Return the notation's system identifier, or null if none is available.

XmlException Class

XmlException is an exception especially designed for reporting XML errors (in languages that support exceptions). The exception encapsulates all of the information provided to handlers in the ErrorHandler interface:

public XmlException (String message, String systemID, int line, int column)
Construct a new instance of an XmlException. For the significance of the arguments, see below.
public String getMessage()
Return a message describing the reason for the exception.
public String getSystemID()
Return the system identifier (URI) of the entity that caused the exception, or null if none is available (or relevant).
public int getLine()
For the problem entity, return the number of the line that caused the exception, or -1 if none is available (or relevant).
public int getColumn()
For the problem line in the problem entity, return the column position that caused the exception, or -1 if none is available (or relevant).

HandlerBase Class

HandlerBase is a convenience base class that provides default implementations for the EntityHandler, DocumentHandler, and ErrorHandler interfaces, as specified under each of the implementations. A user can simply extend this class and override the default behaviour where necessary.


David Megginson, Microstar Software Ltd. < dmeggins@microstar.com >