[PEAK] PEAK-Rules progress & plans

Sat Jun 17 20:00:47 EDT 2006

At 01:28 AM 6/1/2006 -0400, Phillip J. Eby wrote:
>It's too bad Python bytecode doesn't have any kind of "computed jump" 
>feature, otherwise such tables could be built right into the bytecode.

Actually, it turns out that Python bytecode *does* have a feature that can 
be abused to perform repeated jumps:

http://mail.python.org/pipermail/python-dev/2006-June/066115.html

The demo code, using the latest version of my BytecodeAssembler package, is 
here:

http://peak.telecommunity.com/DevCenter/BytecodeAssembler#demo-computed-goto-switch-statement

Unfortunately, the technique is not currently compatible with Psyco or 
PyPy, although Armin Rigo suggests that Psyco might be fixable to support 
the trick:

http://mail.python.org/pipermail/python-dev/2006-June/066133.html

For PyPy, it would probably be better to add a JUMP_TOP opcode or something 
similar, although that would require getting approval for some kind of 
"switch" statement in Python 2.6.

I would hope, however, that the trick can be accomodated in Psyco for older 
versions of Python, as it would essentially allow a generic function's 
dispatch table to be JIT-compiled.

Anyway, this development means that I don't really need to worry about 
doing either recursive function calls or binary search trees for dispatch 
nodes that are best expressed as N-way lookups in a hash table.

As a result, I can thus focus my attention on other aspects of the dispatch 
algorithm, such as choosing expressions to dispatch on:

>The main parts remaining that I'd like to optimize have to do with the 
>laziness of index generation and especially the computation of the "best" 
>index for dispatching.  RuleDispatch goes through a lot of convolutions to 
>optimize the indices so that they can quickly compute their "goodness" for 
>an arbitrary collection of active rules.  However, I wonder if it wouldn't 
>be easier and/or simpler to just recompute indexes on each subtree 
>buildout, especially if the indexes' "goodness" measurements were 
>conducted using a sampling approach.

After thinking about this some more, it seems to me that recomputing the 
indexes on sub-dispatches doesn't make sense, as it's rather more complex 
than just updating the indexes eagerly the way RuleDispatch does.

By the same token, I'm wondering how much less than perfectly optimal a 
dispatch tree can be, before the dispatching cost exceeds the cost of 
building a more efficient tree.  RuleDispatch recomputes the "goodness" of 
each index as a candidate for dispatching at *every* dispatch node, and 
even using the wacky indexing techniques that I came up with, this is an 
O(n*m) operation, where n is the number of cases remaining at that node, 
and m is the number of possible dispatch candidates.  The incremental cost 
of building out those nodes is thus fairly high, albeit very lazy.

However, I'm not sure it's necessary to be so particular about which 
indexes are used, because small differences in the "goodness" of indexes 
aren't all that important once you get past the first lookup.

Also, the more I've thought about the actual use cases, it seems to me that 
you can divide dispatch subtrees into one of two types: "wide" and 
"deep".  A wide subtree is one that has some indexes which are dramatically 
better than others for that point in the dispatch.  For example, a generic 
function that dispatches on twenty classes in one argument position, and 
then has a bunch of other odds and ends tested elsewhere, is "wide" because 
that one index is way ahead of anything else.

A "deep" subtree, on the other hand, has lots of possible choices for what 
to test on, and none of them are significantly better than the others.  The 
same generic function that dispatches on twenty classes, once you've picked 
*which* class, will then not anything very obviously good to switch 
on.  Such subtrees are probably more suited to writing out a simple if-then 
tree, rather than trying to do any kind of indexing, as the indexing will 
probably cost more.

The children of a "wide" subtree may be either wide or deep, but the 
children of a deep subtree are also deep.  This characteristic suggests 
that the best overall way to implement a generic function dispatch tree is 
to have one level of indexed "wide" dispatching that does all of the "good" 
lookups to find a child function that it then calls to do the "deep" 
dispatching, if any.

This top-level dispatching can be made "wider" by using combined indexes -- 
for example, the PEAK-Rules core dispatcher currently works by doing a 
single "wide" lookup on the types of all its arguments against a 
cache.  This approach could be extended to any cacheable lookup keys, and 
it's also something that could be explicitly hinted by the programmer for 
performance if necessary.

Identifying whether a particular subtree is wide or deep is 
interesting.  When the best available dispatching index has two possible 
branches, it's quite clear that you've reached a "deep" 
situation.  However, it's probably better to look at it in terms of the 
average number of remaining cases on all branches of the subtree.  The 
nearer this average is to 1, the "wider" the subtree/index.  The nearer it 
is to total the number of remaining cases, the "deeper" it is.  And if it 
is around N/2, then it is definitely deep.

If we look at this once, at the root of the dispatch tree, we can simply 
walk down the indexes in order of this goodness value, and simply pick a 
threshhold at which we make a division between wide indexes and deep ones, 
and eagerly create high-level lookup tables for the wide indexes, and 
lazily create functions that encode the corresponding "deep" subtrees.

Up to some maximum number of methods, it probably also makes sense to allow 
including a "wide" dispatch in a single function.  If a function has 10 
methods and one wide index, it probably doesn't make sense to split the 
dispatch into two levels.  As an index's absolute shallowness approaches 1, 
it makes more sense to just go ahead and code out its subtrees in-line, 
even if it is fairly wide.

So, to clarify my terms a bit, let's say that an index's "depth" is the 
average number of cases that will be remaining as possibilities once a 
lookup has been done.  If this depth is 1 (or less, although it can't 
actually fall to zero), then the index is as shallow as it can possibly 
be.  If this depth is N/2 or greater, then the index is extremely deep.

Now, if we also measure how "wide" an index is by how many *distinct* 
choices it has for branching, then we can determine its "area" by 
multiplying the two.  This "area" tells us roughly how large a generated 
function would be.  If an index is wide, but its total area is too large to 
put into one function, we know we should create a top-level function that 
lazily expands subtrees as child functions.  (Note: by "distinct", I mean 
how many distinct subtrees can occur, regardless of how many keys might 
lead to a given subtree.)

The top-level function might include more than one level of dispatch, 
however.  If the next deeper index is close in "area" to the first, it may 
make sense to make the top-level dispatch be multidimensional, as is always 
done for the peak.rules.core dispatcher.  This increases the effective 
"width" (by multiplying the widths of the individual indexes) and decreases 
the effective "depth" (by having fewer choices left on each possible branch).

So, to sum up the heuristics, it seems like we could simply look at all 
candidate indexes for the first dispatch, ordered by increasing 
depth.  Create an N-dimensional dispatch index by taking each index whose 
area is "too large", and spin those off into a top-level lookup 
function.  If there are no such indexes, then just write out a single, 
fully expanded dispatch tree.

Structuring the heuristics this way means that I can treat this "leveling" 
process as a feature to be added later.  I can at first just implement the 
"write it all out as bytecode" version and leave it at that.  Only when the 
functions get large enough in their total size will adding this "leveling" 
optimization be necessary.

Interestingly, however, there doesn't seem to be any way to eliminate the 
necessity of re-determining the index's depth at each new subtree, even in 
the "write it all out as bytecode" case.  However, now that I've gone 
through this analysis, I understand better why the costs of determining 
those depths aren't likely to be a problem in practice.  If you're in a 
subtree with a relatively deep indexes, the chances are good that most of 
them aren't eligible candidate indexes anyway, due to other conditions 
being prerequisites.

For example, pattern matching and parsing usually have deep conditions like 
"object is of type x and attribute y of the object is type z", where you 
can't test the second condition until the first is proven true.   So, you 
are likely only inspecting the "goodness" of a small number of candidate 
conditions to begin with, and as long as the process of knowing whether an 
index is eligible is fast, then you won't be checking out many indexes' 
applicability.  So the best bet for performance at this level is not going 
to be so much in making it fast to find an index's average depth, but in 
making sure that it's efficient to ignore indexes that aren't 
eligible.  RuleDispatch's current approach could perhaps use some small 
improvements in that respect.

Whew.  I think that's about all my brain can handle for today.  :)  It's 
beginning to look like RuleDispatch is pretty close to as algorithmically 
optimal as it's possible to be.  Where it's not as optimal is in the 
constant factors, like the speed of computing expressions and branching, 
and those will be easily surpassed by bytecode generation in 
PEAK-Rules.  On the plus side, this means I'll be taking more from 
RuleDispatch's indexing and tree building algorithms than I thought.  On 
the minus side, I don't see any simplifications taking place there, except 
for those that come from having simple generic functions available for the 
implementation, and from having a simpler way to encode expressions 
(courtesy of BytecodeAssembler).