3.3 Searching Element Content

Let's tackle a slightly more complicated task, printing out all issues written by a certain author. This now requires looking at element content, because the writer's name is inside a writer element: <writer>Peter Milligan</writer>.

The search will be performed using the following algorithm:

1.: The startElement method will be more complicated. For comic elements, the handler has to save the title and number, in case this comic is later found to match the search criterion. For writer elements, it sets a inWriterContent flag to true, and sets a writerName attribute to the empty string.
2.: Characters outside of XML tags must be processed. When inWriterContent is true, these characters must be added to the writerName string.
3.: When the writer element is finished, we've now collected all of the element's content in the writerName attribute, so we can check if the name matches the one we're searching for, and if so, print the information about this comic. We must also set inWriterContent back to false.

Here's the first part of the code; this implements step 1.

from xml.sax import saxlib
import string

def normalize_whitespace(text):
    "Remove redundant whitespace from a string"
    return string.join( string.split(text), ' ')

class FindWriter(saxlib.HandlerBase):
    def __init__(self, search_name):
        # Save the name we're looking for
        self.search_name = normalize_whitespace( search_name )

        # Initialize the flag to false
        self.inWriterContent = 0

    def startElement(self, name, attrs):
        # If it's a comic element, save the title and issue
        if name == 'comic':
            title = normalize_whitespace( attrs.get('title', "") )
            number = normalize_whitespace( attrs.get('number', "") )
            self.this_title = title
            self.this_number = number

        # If it's the start of a writer element, set flag
        elif name == 'writer':
            self.inWriterContent = 1
            self.writerName = ""

The startElement() method has been discussed previously. Now we have to look at how the content of elements is processed.

The normalize_whitespace() function is important, and you'll probably use it in your own code. XML treats whitespace very flexibly; you can include extra spaces or newlines wherever you like. This means that you must normalize the whitespace before comparing attribute values or element content; otherwise the comparision might produce a wrong result due to the content of two elements having different amounts of whitespace.

    def characters(self, ch, start, length):
        if self.inWriterContent:
            self.writerName = self.writerName + ch[start:start+length]

The characters() method is called for characters that aren't inside XML tags. ch is a string of characters, and start is the point in the string where the characters start. length is the length of the character data. You should not assume that start is equal to 0, or that all of ch is the character data. An XML parser could be implemented to read the entire document into memory as a string, and then operate by indexing into the string. This would mean that ch would always contain the entire document, and only the values of start and length would be changed.

You also shouldn't assume that all the characters are passed in a single function call. In the example above, there might be only one call to characters() for the string "Peter Milligan", or it might call characters() once for each character. More realistically, if the content contains an entity reference, as in "Wagner & Seagle", the parser might call the method three times; once for "Wagner ", once for "&", represented by the entity reference, and again for " Seagle".

For step 2 of FindWriter, characters() only has to check inWriterContent, and if it's true, add the characters to the string being built up.

Finally, when the writer element ends, the entire name has been collected, so we can compare it to the name we're searching for.

    def endElement(self, name):
        if name == 'writer':
            self.inWriterContent = 0
            self.writerName = normalize_whitespace(self.writerName)
            if self.search_name == self.writerName:
                print 'Found:', self.this_title, self.this_number

This is an unrealistically stupid comparison function that will be fooled by differing whitespace, but it's good enough for an example.

End tags can't have attributes on them, so there's no attrs parameter. Empty elements with attributes, such as "<arc name="Season of Mists"/>", will result in a call to startElement(), followed immediately by a call to endElement().

XXX how are external entities handled? Anything special need to be done for them?