[TransWarp] Template parsing and XML/HTML processing design

Sat Jul 19 16:21:32 EDT 2003

* We will only support parsing well-formed XML, including XHTML.  HTML Tidy 
can be used to convert HTML documents, but we will not be including it in 
PEAK at this time, as the available Python bindings for tidylib are awkward 
to build or redistribute.

* Parser: for full document fidelity, we will need to use/require 
Expat.  SAX parsing interfaces do not supply DOCTYPE or comments.  We 
probably need to support pass-through of comments, so that embedded 
JavaScript can be hidden in HTML comments.  We also want to pass through 
DOCTYPE for the benefit of modern browsers to know we're using XHTML.

* Tag components: each document element will be mapped to a component, 
beginning with a top-level component for the document as a whole.  Tags 
will support being used as a web behavior (i.e. render() method) and as a 
sub-template/view component in another template.

* Sub-templates: when a template is used as a view in another template, it 
should be possible to mark the portion(s) of the template to be used as a 
subtemplate.  In other words, if you have a template page that contains a 
head and body, you could mark the body as being the subtemplate, so in the 
case where you use that template as a view in another template, the head, 
doctype, etc. will not be included in the containing template, which 
presumably has its own copies of these things.  (Note that for this to work 
right, the subtemplate will need to either carry its own xmlns declarations 
(if any) or they'll have to match those of the containing template.)

* Tag building interface: tag components will have a tag-building 
interface, so that the parser can tell them about their contents.  This 
interface will probably be adaptable to I_SOXNode_NS or an extension 
thereof, so we can reuse some of our existing XML parsing 
infrastructure.  The interface will also need to have an 'addPattern(name, 
component)' method that accepts the contained tags with pattern="name" 
attributes, and a way to determine whether the tag is entirely static (i.e. 
contains no dynamic content), and if so, retrieve the static XML for that tag.

* Supplying context/properties between tags: if we simply invoke a template 
as a view in another template, it will inherit all its properties from the 
originating template, rather than from where it is invoked.  I'm not sure 
if this is really an issue, or whether it will produce any unexpected 
results.  In effect, it means that a view can only "offer" components or 
values to views that are physically in the same page, unless templates used 
as views "copy" themselves into the page that uses them.  In addition to 
lexical context like this, tags may want to supply runtime context data to 
child components.

I'll have to consider both of these issues further to design the view 
invocation interface.  Probably, the solution for dynamic context is to 
allow passing an "execution context" component from one tag to another at 
runtime, using configuration properties to pass values.  If a tag has 
nothing to add to the property namespace, it just passes the current one on 
to its children.  I'll treat lexical context of embedded views as a 
STASCTAP YAGNI for the framework, because you can always write a custom 
view class, and passing lexical properties across templates means you can't 
understand a template without understanding the place it's invoked from, 
anyway.

* Parser options: the parser will need to know a number of things:

1. Are we parsing HTML?  (if so, certain tags need to be told that they 
should be rendered empty, e.g. HR, BR, etc.)

2. Will an XML namespace be required on view/model/pattern 
attributes?  (this will be optional, for convenience/brevity.)

3. Should the document be rendered with, or without, the added template 
markup?  (This changes how attribute data is supplied to the tag components.)

4. What component is the parent of the top-level document?

#1 can probably be guessed from the DOCTYPE and/or xmlns declarations.  #2 
can probably be guessed by the absence of an xmlns declaration for our 
namespace URI.  #4 is going to get passed in anyway.  #3 is hard and maybe 
a YAGNI anyway.

Probably what we should do is tell the tags about the added markup, because 
it's really up to the tag object *how* to roundtrip the markup.  At some 
point, perhaps we'll add a property that tags can use to determine whether 
they should include the roundtrip data in the output.  Actually, this might 
make an interesting test/demo for skins, since one could create a 
'roundtrip' skin that sets the property to include the extra markup.

Okay, I think that's enough for now.  So it looks like the interface to-do 
list is:

* Tag as builder of a subdocument: addPattern(name,node), addChild(node)

* View (i.e., tag factory):  __call__(parent, tagName=, attribItems=, 
viewProperty=None, modelPath=None, patternName=None, nonEmpty = False) -- 
is there anything else that should be there?  XML Namespace mapping 
information, maybe?  Source file/line/column?  That'd help with debugging; 
maybe tags should set __traceback_info__ to include that info during execution.

* Tag as document node: 'staticText' attribute (=None if dynamic)

* Tag as behavior: render(interaction), which calls...

* Tag as subtemplate: renderTo(interaction, writeFunc, currentModel, 
executionContext)

And the concrete classes will need to include Tag, Text, and Literal, where 
Text is plain text (and thus needs escaping/entity encoding), and Literals 
are used to represent doctype, comments, processing instructions, DTD 
definitions, etc.  The parser will create Tag, Text, or Literal instances 
for everything that isn't marked with a 'view' attribute; for those, the 
view name will be looked up as a property on the enclosing tag to retrieve 
a view (tag factory) object.  It may be that we will want to support 
adapting various constant types (e.g. strings, numbers, etc.) to tag 
factories, so that skin-supplied property values can be embedded in a 
template at compile time.  But that's optional.

Rather than hardcode even the Tag, Text, and Literal types, they should 
probably be looked up as properties too, although this would probably be 
quite slow.  What we could do instead, is define required attributes 
'tagFactory', 'textFactory', and 'literalFactory' on the "tag as 
subdocument builder" interface.  These could map to properties, but would 
cache the lookup for subsequent and nested textual units, thus avoiding 
huge numbers of property lookups going all the way up to the config 
root.  Anyway, this would let a view redefine how contained text and 
literals were turned into objects.