5. xmlarch: Architectural Forms

The xmlarch module contains an XML architectural forms processor written in Python, allowing you to process XML architectural forms using any parser that uses the SAX interfaces. The module allows processing several architectures in one parsing pass. Architectural document events for an architecture can even be broadcast to multiple DocumentHandler instances. (e.g. you can have 2 handlers for the RDF architecture, 3 for the XLink architecture and perhaps one for the HyTime architecture.)

The architecture processor uses the SAX DocumentHandler interface which means that you can register the architecture handler (an instance of the ArchDocHandler class) with any SAX 1.0 compliant parser. It currently does not process any meta document type definition documents (meta-DTDs). When a DTD parser module is available the code will be modified to use that in order to process meta-DTD information. Please note that validating and well-formed parsers may report different SAX events when parsing documents.

The xmlarch module contains six classes: ArchDocHandler, Architecture, ArchParseState, ArchException, AttributeParser and Normalizer.

Using the xmlarch module usually means that you have to do the following things:

A simple example

Python code:

# Import needed modules
from xml.sax import saxexts, saxlib, saxutils
import sys, xmlarch

# Create architecture processor handler
arch_handler = xmlarch.ArchDocHandler()

# Create parser and register architecture processor with it
parser = saxexts.XMLParserFactory.make_parser()
parser.setDocumentHandler(arch_handler)

# Add an document handler to process the html architecture
arch_handler.addArchDocumentHandler("html", xmlarch.Normalizer(sys.stdout))

# Parse (and process) the document
parser.parse("simple.xml")

A sample XML document:

<?xml version="1.0"?>
<?IS10744:arch name="html"?>
<doc>
<title html="h1">My first architectual document</title>
<author html="address">Geir Ove Gronmo, grove@infotek.no</author>
<para>This is the first paragraph in this document</para>
<para html="p">This is the second paragraph</para>
</doc>

The result:

<html>
<h1>My first architectual document</h1>
<address>Geir Ove Gronmo, grove@infotek.no</address>

<p>This is the second paragraph</p>
</html>

See also the files "simple.py" and "simple.xml" in the "demo/arch" directory of the Python/XML distribution. If you try to process the persons architecture in this document instead you get the following output:

<persons>

<author>Geir Ove Grønmo</author><mentioned>Eliot Kimber</mentioned><mentioned>D
avid Megginson</mentioned><mentioned>Lars Marius Garshol</mentioned>
</persons>

A more complex example:

Python code:

# Import needed modules
from xml.sax import saxexts, saxlib, saxutils
import sys, xmlarch

# create architecture processor handler
arch_handler = xmlarch.ArchDocHandler()

# Create parser and register architecture processor with it
parser = saxexts.XMLParserFactory.make_parser()
parser.setDocumentHandler(arch_handler)

# Add an document handlers to process the html and biblio architectures
arch_handler.addArchDocumentHandler("html", xmlarch.Normalizer(open("html.out",
 "w")))
arch_handler.addArchDocumentHandler("biblio", saxutils.ESISDocHandler(open("bib
lio1.out", "w")))
arch_handler.addArchDocumentHandler("biblio", saxutils.Canonizer(open("biblio2.
out", "w")))

# Register a default document handler that just passes through any incoming eve
nts
arch_handler.setDefaultDocumentHandler(xmlarch.Normalizer(sys.stdout))

# Parse (and process) the document
parser.parse("complex.xml")

Because this causes a lot of output I've not included the XML document and the results. See instead the files "complex.py" and "complex.xml" in the "demo/xml" directory of the Python/XML distribution and try it yourself.