The Simple API for XML isn't a standard in the formal sense, but an informal specification designed by David Megginson, with input from many people on the xml-dev mailing list. SAX defines an event-driven interface for parsing XML. To use SAX, you must create Python class instances which implement a specified interface, and the parser will then call various methods of those objects.
SAX is most suitable for purposes where you want to read through an entire XML document from beginning to end, and perform some computation, such as building a data structure representating a document, or summarizing information in a document (computing an average value of a certain element, for example). It's not very useful if you want to modify the document structure in some complicated way that involves changing how elements are nested, though it could be used if you simply wish to change element contents or attributes. For example, you would not want to re-order chapters in a book using SAX, but you might want to change the contents of any name elements with the attribute lang equal to 'greek' into Greek letters.
One advantage of SAX is speed and simplicity. Let's say you've defined a complicated DTD for listing comic books, and you wish to scan through your collection and list everything written by Neil Gaiman. For this specialized task, there's no need to expend effort examining elements for artists and editors and colourists, because they're irrelevant to the search. You can therefore write a class instance which ignores all elements that aren't writer.
Another advantage is that you don't have the whole document resident in memory at any one time, which matters if you are processing really huge documents.
SAX defines 4 basic interfaces; an SAX-compliant XML parser can be passed any objects that support these interfaces, and will call various methods as data is processed. Your task, therefore, is to implement those interfaces that are relevant to your application.
The SAX interfaces are:
Interface | Purpose |
---|---|
DocumentHandler |
Called for general document events. This interface is the heart of SAX; its methods are called for the start of the document, the start and end of elements, and for the characters of data contained inside elements. |
DTDHandler |
Called to handle DTD events required for basic parsing. This means notation declarations (XML spec section XXX) and unparsed entity declarations (XML spec section XXX). |
EntityResolver |
Called to resolve references to external entities. If your documents will have no external entity references, you won't need to implement this interface. |
ErrorHandler |
Called for error handling. The parser will call methods from this interface to report all warnings and errors. |
Python doesn't support the concept of interfaces, so the interfaces listed above are implemented as Python classes. The default method implementations are defined to do nothing--the method body is just a Python pass statement-so usually you can simply ignore methods that aren't relevant to your application. The one big exception is the ErrorHandler interface; if you don't provide methods that print a message or otherwise take some action, errors in the XML data will be silently ignored. This is almost certainly not what you want your application to do, so always implement at least the error() and fatalError() methods. xml.sax.saxutils provides an ErrorPrinter class which sends error messages to standard error, and an ErrorRaiser class which raises an exception for any warnings or errors.
Pseudo-code for using SAX looks something like this:
# Define your specialized handler classes from xml.sax import saxlib class docHandler(saxlib.DocumentHandler): ... # Create an instance of the handler classes dh = docHandler() # Create an XML parser parser = ... # Tell the parser to use your handler instance parser.setDocumentHandler(dh) # Parse the file; your handler's method will get called parser.parseFile(sys.stdin)