Introduction
The software used to create this blog writes a number of files: the individual HTML pages, the HTML index files and the Atom feed. Though it's not necessary for the HTML to be well-formed XML I do have a strong preference for it to be. Actually, I'd rather just use XHTML but that's a separate rant.
XML is just text and can be treated as such but there are enough
traps for the wary, let alone the unwary, that it's really not a
good plan; in particular, when doing template substitution on the text
form it is quite hard to guarantee that the result will be well
formed. It's much better to handle it as a proper document
structure and leave serialization to code which specializes in
doing that right.
Python has a number of packages for handling XML with
varying degrees of power and standarization but I decided to just
use the basic
xml.dom
and
xml.dom.minidom
for simplicity.
The lack of any form of XPath expressions or the like could be
a pain in some applications but isn't a problem here.
The compromised namespace management is more irritating but I've
hived off dealing with that to a separate module avoiding too
much grief while accepting some limitations.
The general scheme is to have template documents stored in the Python code as multiline strings and "compiled" to DOM trees on load. For each output document the Python code recurses down the appropriate template tree producing the output by copying most of the template and performing substitutions for elements in a specific namespace.
The template document therefore has a similar flavour to an XSLT Literal Result Element StyleSheet. The main difference is that the values substituted come from Python data structures rather than from an input XML document.
Template Elements
Here's a template fragment which illustrates four of the special template elements. It's the part which generates the blog link subtitle for blog entry pages (An Eccentric Anomaly: Ed Davies's Blog, see the title bar, above). I've changed the whitespace for readability, the original is set so that the whitespace carried through makes the result document reasonably readable at the cost of the template being a bit of a puzzle:
<t:for_each value="blog?"> <h2> <a> <t:attribute name="href">/<t:path generator=".path"/>/</t:attribute> <span class="blogTitle"> <t:sequence value=".title"/> </span> <span class="blogTitleSeparator">: </span> <span class="blogSubtitle"> <t:sequence value=".subtitle"/> </span> </a> </h2> </t:for_each>
With the prefix t
bound to the
appropriate namespace
the special elements processed are:
<t:sequence value="..." generator="...">
Substitutes the value(s) specified by the value
and
generator
attributes as described below. If multiple
values are specified they are simply concatenated with no
intervening whitespace.
<t:for_each value="..." generator="...">
Expands the contents of the for_each
element once for each
value specified by the value
and generator
attributes as described below. During each expansion the value is pushed
on to the top of the context stack (also described below) and popped
back off afterwards.
<t:attribute name="...">
Sets an attribute named by the name
attribute on the
closest enclosing result element. The attribute value is found
by concatenating the node values (text strings stripped of any XML
markup) of the children of
the t:attribute
element. Typically this will be a
single t:sequence
element. The attribute is put in
the default namespace.
<t:path value="..." generator="...">
Makes a list of the string values of the contents of the t:path
element and the values specified by the value
and
generator
attributes and joins them together separated by
the URI path separator character ('/'). Substitutes that into the result.
Expressions
The values of the value
and generator
attributes
of the various template elements are expressions which reference
Python values of various sorts.
The basis of this access is a context stack. This is a simple
list of Python objects. Any sort of Python object can be used but they're
typically normal objects with properties to be accessed by name in the
expressions or dict
s with string keys to be accessed
similarly.
When a template substitution operation is started a root context object
is specified. For example, when a page for an individual blog entry is to
be produced this is a Python object with properties containing or
referencing all the required information
about the blog entry. This forms the single initial value on the context stack.
Whenever a
t:for_each
element is encountered the values it references
are, in turn, pushed onto the context stack, the body of the
t:for_each
is expanded and the value popped off the stack.
t:for_each
es can be nested in the obvious way.
Rather than duplicate some writing, here are the starts of the two main functions responsible for expression evaluation:
def _expr(self, expr): """ Evaluate an expression in the context of this generator. expr ::= simpleExpr ( '|' simpleExpr )* The first simpleExpr which evaluates to something other than None is the result. """
def _simpleExpr(self, expr): """ Evaluate a simple expression string in the context of this generator. simpleExpr ::= contextSpec ( propRef ( '.' propRef )* )? That is, a (possibly empty) context specification followed by zero or more dot-separated property references. contextSpec ::= '.'* An empty context specification indicates use of the context object passed to the top-level XMLTemplate.generate function. One dot means the current object of the nearest enclosing t:for_each, two dots the one surrounding that and so on. When there are n nested t:for_each elements n+1 dots is synonymous with none (i.e., accesses the top-level context object). More than n+1 dots is an error. In principle an empty expression could be a reference to the top-level context object but, since that's not likely to be useful it's taken as an error. Might revisit this if an expression like '.wibble?|' (i.e., the wibble property of the innermost context object, if any, otherwise the global context object) was ever found to be useful. propRef ::= propName '?'? propName ::= <any characters other than '.', '?' or '|'>+ A property reference names a property of an Python object or, if it is a dictionary (member of class dict or a derivative) then the value for that key. An appended question mark indicates that the property is optional, if it is not present then None is returned rather than raising an error. """
Expression Values
Expression evaluation first accesses the appropriate Python object property or dictionary values described above. If the result is callable it is called and the returned value is used. None values (either original or the result of a call) are discarded.
For the generator
attribute the resulting value must
be an iterable other than a string. It is iterated and the resulting
values are used. The value
attribute's value is used
directly. If, slightly oddly, an element has both a value
and generator
attribute then the value
's
value is used first then the generator
's values.
The value obtained is typically transformed into a DOM tree in
the result document. If it's a DOM node it's deep copied in.
Iterables other than strings are iterated and the resulting
values are copied in, recursively. Strings and anything which
can be converted to a string (using the unicode
function) is converted to an XML text node.
Here's the start of the docstring for the function responsible for that:
Yields deep copies of possibly, virtual DOM, tree(s) into the new DOM Nodes in the implementation of the result document. source can be, in descending order of precedence: None Yields nothing. A DOM node: Deep copied into the result implementation: see code for odd cases. A tree walker: Something implementing domTreeIterator which returns a virtual DOM tree or forest to be copied into the result implementation. A string: Instance of basestring (i.e., str or unicode) which is copied into a text node in the result implementation. An iterable: Iterated yielding the copies of the elements. Anything else: Anything implementing __unicode__ or __str__ which is copied into a text node in the result implementation.
One of the odd cases with DOM nodes alluded to above is CDATA sections. If the result is to be interpreted as HTML then these are converted to text nodes (as they're not part of HTML). Where text is to remain as XML (such as the XHTML in the Atom feed) then CDATA sectionness (sectionality?) is preserved.
domTreeIterator
is my own function, analogous to
Python's __iter__
, which returns an iterator
over the nodes of a DOM tree. This is particularly handy for
"virtual" DOM trees created from parts of existing ones, e.g.,
the first few paragraphs of a blog entry which are to appear
in the blog index HTML document and the Atom feed.
Code
I tried showing a few examples but they seemed to need a bit much context to be interesting so, instead, here are a few key modules which should give a pretty good idea.
The main template substitution code is in xmltemplate.py. This declares the XMLTemplate class which is used in a couple of places including genericpagetemplate.py. The templating code is in a separate package whose __init__.py handles the overall templating operation once the higher level code has dealt with making a working copy of the files and before it deals with deployment (rsync or local serving).
If anybody's interested in looking at the rest of the code just ask and I'll see about putting it on BitBucket or something.