3.1 Starting Out

Following the earlier example, let's consider a simple XML format for storing information about a comic book collection. Here's a sample document for a collection consisting of a single issue:

<collection>
  <comic title="Sandman" number='62'>
    <writer>Neil Gaiman</writer>
    <penciller pages='1-9,18-24'>Glyn Dillon</penciller>
    <penciller pages="10-17">Charles Vess</penciller>
  </comic>
</collection>

An XML document must have a single root element; this is the "collection" element. It has one child comic element for each issue; the book's title and number are given as attributes of the comic element, which can have one or more children containing the issue's writer and artists. There may be several artists or writers for a single issue.

Let's start off with something simple: a document handler named FindIssue that reports whether a given issue is in the collection.

from xml.sax import saxlib

class FindIssue(saxlib.HandlerBase):
    def __init__(self, title, number):
        self.search_title, self.search_number = title, number

The HandlerBase class inherits from all four interfaces: DocumentHandler, DTDHandler, EntityResolver, and ErrorHandler. This is what you should use if you want to use one class for everything. When you want separate classes for each purpose, you can just subclass each interface individually. Neither of the two approaches is always ``better'' than the other; their suitability depends on what you're trying to do, and on what you prefer.

Since this class is doing a search, an instance needs to know what to search for. The desired title and issue number are passed to the FindIssue constructor, and stored as part of the instance.

Now let's look at the function which actually does all the work. This simple task only requires looking at the attributes of a given element, so only the startElement method is relevant.

    def startElement(self, name, attrs):
        # If it's not a comic element, ignore it
        if name != 'comic': return

        # Look for the title and number attributes (see text)
        title = attrs.get('title', None)
        number = attrs.get('number', None)
        if title == self.search_title and number == self.search_number:
            print title, '#'+str(number), 'found'

The startElement() method is passed a string giving the name of the element, and an instance containing the element's attributes. The latter implements the AttributeList interface, which includes most of the semantics of Python dictionaries. Therefore, the function looks for comic elements, and compares the specified title and number attributes to the search values. If they match, a message is printed out.

startElement() is called for every single element in the document. If you added print 'Starting element:', name to the top of startElement(), you would get the following output.

Starting element: collection
Starting element: comic
Starting element: writer
Starting element: penciller
Starting element: penciller

To actually use the class, we need top-level code that creates instances of a parser and of FindIssue, associates them, and then calls a parser method to process the input.

from xml.sax import saxexts

if __name__ == '__main__':
    # Create a parser
    parser = saxexts.make_parser()

    # Create the handler
    dh = FindIssue('Sandman', '62')

    # Tell the parser to use our handler
    parser.setDocumentHandler(dh)

    # Parse the input
    parser.parseFile(file)

The ParserFactory class can automate the job of creating parsers. There are already several XML parsers available to Python, and more might be added in future. "xmllib.py" is included with Python 1.5, so it's always available, but it's also not particularly fast. A faster version of "xmllib.py" is included in xml.parsers. The pyexpat module is faster still, so it's obviously a preferred choice if it's available. ParserFactory's make_parser method determines which parsers are available and chooses the fastest one, so you don't have to know what the different parsers are, or how they differ. (You can also tell make_parser to use a given parser, if you want to use a specific one.)

Once you've created a parser instance, calling setDocumentHandler tells the parser what to use as the handler.

If you run the above code with the sample XML document, it'll output Sandman #62 found.