1. Introduction to XML

XML, the eXtensible Markup Language, is a simplified dialect of SGML, the Standardized General Markup Language. XML is intended to be reasonably simple to implement and use, and is already being used for specifying markup languages for various new standards: MathML for expressing mathematical equations, XXX SMIL (Expand acronym) for synchronizing multimedia objects, and so forth.

SGML and XML represent a document by tagging the document's various components with their function, or meaning. For example, an academic paper contains several parts: it has a title, one or more authors, an abstract, the actual text of the paper, a list of references, and so forth. A markup languge for writing such papers would therefore have tags for indicating what the contents of the abstract are, what the title is, and so forth. This should not be confused with the physical details of how the document is actually printed on paper. The abstract might be printed with narrow margins in a smaller font than the rest of the document, but the markup usually won't be concerned with details such as this; other software will translate from the markup language to a typesetting language such as , and will handle the details.

A markup language specified using XML looks a lot like HTML; a document consists of a single element, which contains sub-elements, which can have further sub-elements inside them. Elements are indicated by tags in the text. Tags are always inside angle brackets < >. There are two forms of elements. An element can contain content between opening and closing tags, as in <name>Euryale</name>, which is a name element containing the data "Euryale". This content may be text data, other XML elements, or a mixture of both. Elements can also be empty, in which case they contain nothing, and are represented as a single tag ended with a slash, as in <stop/>, which is an empty stop element. Unlike HTML, XML element names are case-sensitive; stop and Stop are two different element types.

Opening and empty tags can also contain attributes, which specify values associated with an element. For example, text such as <name lang='greek'>Herakles</name>, the name element has a lang attribute which has a value of "greek". This would contrast with <name lang='latin'>Hercules</name>, where the attribute's value is "latin".

A given XML language is specified with a Document Type Definition, or DTD. The DTD declares the element names that are allowed, and how elements can be nested inside each other. The DTD also specifies the attributes that can be provided for each element, their default values, and if they can be omitted. For example, to take an example from HTML, the LI element, representing an entry in a list, can only occur inside certain elements which represent lists, such as OL or UL. A validating parser can be given a DTD and a document, and verify whether a given document is legal according to the DTD's rules, or determine that one or more rules have been violated.

Applications that process XML can be classed into two types. The simplest class is an application that only handles one particular markup language. For example, a chemistry program may only need to process Chemical Markup Language, but not MathML. This application can therefore be written specifically for a single DTD, and doesn't need to be capable of handling multiple markup languages. This type is simpler to write, and can easily be implemented with the available Python software.

The second type of application is less common, and has to be able to handle any markup language you throw at it. An example might be a smart XML editor that helps you to write XML that conforms to a selected DTD; it might do so by not letting you enter an element where it would be illegal, or by suggesting elements that can be placed at the current cursor location. Such an application needs to handle any possible XML-defined markup, and therefore must be able to obtain a data structure embodying the DTD in use. XXX This type of application can't currently be implemented in Python without difficulty (XXX but wait and see if a DTD module is included...)

For the full details of XML's syntax, the one definitive source is the XML 1.0 specification, available on the Web at http://www.w3.org/TR/xml-spec.html. However, like all specifications, it's quite formal and isn't intended to be a friendly introduction or a tutorial. The annotated version of the standard, at http://www.xml.com/XXX, is quite helpful in clarifying the specification's intent. There are also various informal tutorials and books available to introduce you to XML.

The rest of this HOWTO will assume that you're familiar with the relevant terminology. Most section will use XML terms such as element and attribute; section 4 on the Document Object Model will assume that you've read the relevant Working Draft, and are familiar with things like Iterators and Nodes. Section 3 does not require that you have experience with the Java SAX implentations.