[PEAK] PEAK-Rules progress & plans

Thu Jun 1 01:28:37 EDT 2006

So after completing the core framework last weekend, I'm doing some 
thinking about what things I might want to accomplish this next 
weekend.  The following is a brain dump of some possible plans.

Expression Code Generation
--------------------------

The most likely first candidate is code generation for expressions.  The 
BytecodeAssembler package now has a rudimentary facility for generating 
code from AST-like data structures, and I'd like to expand on that a 
bit.  What I'd like to end up with is an object representing a particular 
point in the generation of a dispatch tree, that knows what expressions 
have already been computed on the current execution path, and what 
(generated) local variables hold them.

The high-level interface should be that you just ask this object to put an 
expression value on the stack for you, and it figures out how to do 
that.  In the simplest case it would just issue a LOAD_FAST opcode for some 
existing variable name (e.g. a function argument), but in more complex 
cases it would need to calculate the value from possibly-cached values, and 
maybe cache some intermediate values along the way if those values are used 
as part of other expressions that might be needed later.

The other high-level feature needed would be the ability to "fork" the 
object and produce another one that will be used along a different code 
execution path, so that each execution path separately keeps track of what 
values have or have not been computed as yet along that path.

Creating such an object isn't really a big deal, though.  Most of the real 
work is in managing the external data structure it needs in order to know 
whether a particular subexpression is going to be reused along a particular 
pathway, and thus whether caching that value is required.

The other interesting bit is going to be porting the existing expression 
data structures from RuleDispatch.  RuleDispatch uses an approach that's 
based on representing nearly everything imaginable as a function 
call.  This is helpful for constant-folding, because it can just invoke the 
actual function if all its arguments are constants.

But it's not as useful for code generation per se; sure you can turn 
getattr(x,y) back into x.y, but it almost seems to be adding an unnecessary 
layer.  On the other hand, it doesn't require very many node types to 
represent things: RuleDispatch has only 6 (besides Argument): OrExpr, 
AndExpr, Tuple, Getattr, Const, and Call.  But if I took this approach with 
PEAK-Rules, I could actually whittle it down to just Const and Call.  I 
think I like that better.  In fact, with so few node types, I could 
probably embed a lot of it into BytecodeAssembler itself.

Okay, I think I'm sold.  If it goes into BytecodeAssembler, though, there 
needs to be a way to avoid constant folding if you don't want it.  But 
that's easy enough to do by having folding and non-folding constructors for 
the Call type.  All of these node types will need to be hashable and 
comparable by value, not just identity, as PEAK-Rules will need to be able 
to use them as dictionary keys.  But it occurs to me that I could hack 
something together for that using instancemethod objects to curry 
functions.  It would be a little ugly when trying to e.g. prettyprint an 
expression, but that could be done with a specialized function for the purpose.

I was originally thinking of using structType from peak.util.Struct for 
this purpose, and splitting it out into a StructType project, but I'd 
rather not if I don't really need to, especially since it would be a 
dependency for BytecodeAssembler in that case.  Instancemethod objects are 
hashable and comparable by value, so it ought to work.  Complex expressions 
will take longer to compare and hash, but it's probably a premature 
optimization to worry about it.

Okay, so I think that covers the code generation front pretty well.  I'll 
add some ability to inline builtins (e.g. getattr) and to do constant 
folding into BytecodeAssembler, using currying instead of closures in the 
current Call, Const, and Global node types available there, so that 
expression nodes are hashable and comparable.

The expression "state manager" will go into PEAK-Rules itself.  The main 
open question there is how it should generate variable names for 
intermediate expressions.  An obvious choice might be to "decompile" an 
expression and use that string as the variable name, which would make it 
quite clear what was going on when reading a disassembly or viewing a frame 
in a debugger.  Most rule subexpressions aren't terribly complicated (e.g. 
'isinstance(x,Foo)') so this is probably not such a bad idea.  OTOH, 
representing types in such strings would be a pain, and the string size 
could also blow up pretty quickly.

So, tentatively, I'm thinking the variable names could be something like 
'?1: isinstance(x,?2)', where subexpressions are given numbers, but the 
full variable name includes a simple non-nested representation using the 
numbers to indicate subexpressions.  So, any subexpression that appears 
more than once gets its own name, and that name includes only unique 
information that's not in other variables' names.

The only downside to this approach is that tests might end up with 
unstably-assigned variable numbers, unless the algorithm that assigns the 
numbers is sufficiently deterministic.  But that is probably easy enough to 
deal with.  I can wait and see in practice if there are any issues there.

So the state manager facility should have a way to be created with a list 
of known expressions, and it will need to be able to walk them to find out 
what subexpressions they have, so it can compute overall reference counts 
to decide what subexpressions are cacheable and give them variable 
names.  It will also need a way to decrease a subexpression's reference 
count, in order to avoid caching values on branches where those values will 
not actually end up being used.

Dispatch Tree Generation
------------------------

I don't think I can actually start on dispatch tree generation yet, even 
after the expression code generation is done.  Although I'll have a 
notation for expressions to be tested, there aren't any data structures yet 
to represent expression-based signatures or predicates thereof, let alone 
dispatch indexes.  Nonetheless I have a broad idea of what the code 
generation for dispatch trees will look like, so I'm going to dump out my 
thoughts so far.

I actually have been contemplating several alternative ways to do dispatch 
tree code generation.  Some are useful mainly for very large rulesets that 
are well-segmented by certain parameters.  For example, a generic function 
to select "views" in Zope or peak.web is likely to be well-segmented by 
attribute name, such that the extra cost of doing a dictionary lookup and a 
second level of function call could easily pay off in performance, if there 
are dozens of different names and each name has many possible classes with 
registered views.

On the other hand, practical experience to date suggests that most useful 
generic functions in real programs are actually quite small, perhaps with 
only a few dozen methods in all, if even that many.  For these functions, 
other dispatch algorithms would perform much better.  For example, I 
estimate the current dispatch overhead of PEAK-Rules' current default 
engine to be about 2 microseconds static overhead plus about 1 microsecond 
per argument.  A given invocation may take longer, if method combination 
has to occur, but since each type signature is cached when that happens, 
the function just gets faster on each call.

Meanwhile, if the function as a whole has a small enough number of methods 
that the entire dispatch tree can be generated as inline code, the 
resulting performance should be equal to or better than a hand-written 
dispatch function, except for the overhead required to call out from the 
dispatcher to the actual method, and any overhead imposed by any combined 
methods.

What's more, any dispatch mechanism that uses lookup tables to speed up the 
lookup process is going to want to "bottom out" to a pure-bytecode 
dispatcher at some point, as soon as there are a small enough number of 
candidate methods remaining.  So it makes sense that the next engine I 
implement should be the bytecode-based one, leaving table-driven dispatch 
as an optimization that kicks in when some heuristic says it should.

It wouldn't need to be a specialized engine, even.  The algorithm I have in 
mind for tree building would follow roughly the same lines as the one in 
RuleDispatch, it's just that the generated code would differ depending on 
the type of index and how many entries that index had.  In RuleDispatch, 
the indexes determined what node type object would be used, so this isn't 
substantially different, except that here the indexes will be repsonsible 
for generating code, too.  Or more precisely, there will be generic 
functions that generate code based on the index type and engine type, 
allowing different engine types to customize certain aspects of the code 
generation process.

The basic algorithm for tree building is to build indexes for each tested 
expression covered by the current rules, then find out which index is 
"best" to dispatch on next, and build a dispatch node.  Dispatching then 
fans out from that point, using a memoization dictionary to merge any 
branches of the tree that would end up being identical.  At each branch, 
the set of "current" rules is trimmed to exclude anything that's ruled out 
by that branch, so the current set gradually dwindles to nothing, as do the 
currently active indexes (since any expression already tested at a higher 
level of the tree no longer needs to be tracked).

In the new system, this walk will be done against a bytecode object being 
generated, with the expressions to test being generated inline (as 
described in the previous section above).  Each index would then generate 
code for testing the expression's value against its possible values, and to 
branch to different subtrees.  Repeat recursively, and all is well.

Of course, I'm leaving out all the hard bits with hand-waving.  One of the 
most annoyingly complicated bits in RuleDispatch is how hard it is to find 
which index is "best" to dispatch on next.  There are several steps that 
have to be done.

First, you can only use indexes on *enabled* expressions.  An enabled 
expression is one that either appears in the first position of a signature 
(after removing already-dispatched expressions) or is an actual argument to 
the generic function.  The reason for this is to support things like "x!=0 
and y/x>3", where the expression on the right can't be safely computed 
until the one on the left has been determined to be true.

Once you've cut it down to enabled expressions, you have to determine which 
index has the smallest average number of applicable rules for the branches 
it would generate.  The purpose of this is to rule out "bad" dispatch 
expressions that don't narrow things down very much.  If you have one 
parameter that's a boolean value, and another parameter that's one of fifty 
different strings, it's pretty clear that you should test the string 
equality first, as it's going to cut down on the number of other tests you 
have to do to find the applicable methods.

RuleDispatch does this by having indexes that are active and tracking from 
the moment you add a rule to a generic function.  I'm not sure I want to do 
that for PEAK-Rules, because it slows down definition and there's a very 
real possibility that you will never call the generic function even once.

More precisely, the issue is that RuleDispatch very eagerly computes a 
complete dispatch index, taking into consideration all forms of 
inter-criterial implication.  The code to do this has to be very complex in 
order to perform well, so I'd like to take a less-eager approach in 
PEAK-Rules if possible.  For example, it's possible that I could compute a 
simpler statistic as a rough estimate of how "good" an index is for 
dispatching, and then only compute the "real" statistic once it is needed.

Another possibility is that I could lazily generate code, in a manner 
similar to the old PC game "Animal".  In "Animal" the computer asks you a 
series of yes-no questions to guess what animal you are thinking of.  If it 
doesn't know your animal, it asks you what question would distinguish the 
animal it guessed from the animal you were thinking of, and thus it 
"learns" to build a decision tree.

What I'm wondering is if I could do the same thing here.  Initially just 
generate a stub function that calls back to the engine saying "help, I 
don't know what to do".  The engine would examine the *current* arguments, 
figure out what methods apply, and then expand the tree by adding the tests 
that it had to use to determine which methods applied.  Any branches for 
the "paths not taken" would get callback stubs to repeat this process.

So, each time the function was called, it would "learn" what questions to 
ask, and only the useful parts of the tree would be built.  It would be 
slower to answer things it had never answered before, of course, and a big 
potential downside is that it might start taking longer and longer to 
regenerate the bytecode.  On the other hand, I think I maybe see a way to 
redirect the jumps in an existing Code object so the entire tree wouldn't 
be rebuilt, just incrementally expanded.  The bytes would still have to be 
copied, but it wouldn't be nearly so bad.

On the surface, this approach sounds really cool, especially since it 
closely resembles what Psyco does to generate type-specific versions of a 
function.  However, it doesn't leave as well-defined a place for 
table-based dispatches to occur.  What would happen is that you'd get long 
runs of unbalanced subtrees that effectively become linear searches.

It's too bad Python bytecode doesn't have any kind of "computed jump" 
feature, otherwise such tables could be built right into the 
bytecode.  However, Python *can* do a dictionary get() plus about 6 integer 
'<' operations in less than a microsecond, meaning a binary search through 
64 possibilities after looking up a target value in a dictionary is quite 
competitive when compared to a lookup and a function call.

So we might not even need a special optimization for table-based dispatch, 
since N-way dispatch could be implemented with binary trees.  Since it 
takes a *lot* of rules to produce even 64 dispatch-relevant values for a 
single expression (e.g. 64 different values that different rules test 
against the same argument expression), this should actually be very 
competitive as a dispatch technique.  A tree of 32 values or less should be 
another .1 microseconds faster, and a tree of 128 values is only another .1 
microseconds slower.

The tree can even be lazily expanded, at the cost of more memory overhead 
to keep track of where a given subtree was "left off", but it wouldn't be 
much different from the same overhead that RuleDispatch has for such 
lazily-expanded subtrees now.  Probably the sensible thing to do is to 
expand a subtree if the subtree's total size is likely less than the memory 
required to remember how to expand it.  Which means that laziness is really 
just a speed optimization that could be left to a later phase, after the 
eager algorithm is stable.

So, the resulting algorithm at this point seems not much different from 
RuleDispatch, except in the details.  This is good from a risk point of 
view, although it may be a bit tedious to actually implement.

The main parts remaining that I'd like to optimize have to do with the 
laziness of index generation and especially the computation of the "best" 
index for dispatching.  RuleDispatch goes through a lot of convolutions to 
optimize the indices so that they can quickly compute their "goodness" for 
an arbitrary collection of active rules.  However, I wonder if it wouldn't 
be easier and/or simpler to just recompute indexes on each subtree 
buildout, especially if the indexes' "goodness" measurements were conducted 
using a sampling approach.

That is, if you have a lot of rules, it might be simpler to take a 
fixed-size random sampling of rules, and use those to decide what to 
dispatch on next.  My limited understanding of statistics says that the 
chance of a randomly-chosen sample being significantly different in value 
distribution from the population as a whole is very low, but meanwhile it 
would put a fixed-size cap on the maximum analysis time needed to choose a 
dispatch expression for a subtree.

On the other hand, at least RuleDispatch's indexes are a known and 
implemented quantity, with a host of optimizations and a fair amount of use 
-- albeit a near-total lack of unit tests.  But RuleDispatch really only 
has two types of indexes, anyway: hierarchy/identity indexes and range 
indexes.  There are several actual kinds of criteria, but nearly all of 
them end up boiling down to hierarchy indexes once you factor out what 
expression they apply to and what the lookup algorithm is.

Sigh.  Unfortunately, a big part of what's left here is just plain tedious 
to implement correctly.  Viewed in broad strokes, it's all very elegant but 
the details are a pain.  When you do things like "and" criteria and "not" 
criteria applied to isinstance() tests, the table-driven approaches start 
getting messy quickly.  It might actually make more sense to render such 
criteria as literal inline isinstance() tests in some cases - perhaps even 
most cases.

Multiple inheritance is another tricky corner case, because you pretty much 
*have* to recompute indexes when you encounter a multiply-inheriting class 
(while doing type testing).  The base engine in the core framework avoids 
this problem by recomputing *everything* whenever it encounters a 
combination of types it hasn't seen before.  That seems impractical in this 
instance, however.

I guess at this point I'm really not certain how "wide" dispatches (i.e. 
ones with many possible values) should work.  Although the binary search 
thing is possible in principle, it could easily break down such that the 
whole function would have to be recoded whenever it encountered a new 
type.  (Range/equality/identity indexes and single-parent hierarchy indexes 
could avoid this problem, and it wouldn't actually occur in practice except 
on the first encounter with each type whose len(__bases__) was >1.)

It really does seem as though wide type dispatches want to happen via some 
kind of dictionary lookup, rather than via binary searching.  The 
disadvantage is that then there's function calling overhead, but it might 
be worth it.

Ah well.  I think what I'm going to have to do is just give up on all these 
optimization attempts and live with the fact that if/then trees will tend 
towards linear-search behavior as dispatch widths increase, and figure out 
a cost model for determining how long a linear search has to be before it 
costs less to do a function call, and only then make the switch.  It's also 
possible that such an optimization could stack more than one condition into 
the lookup, the way the core engine just dispatches on a tuple of all the 
types.

But whatever the case, I think it's pretty clear I'm not going to be able 
to get to that optimization without first implementing a more "naive" 
approach, so for now I'll focus on the basics.  I'm probably overoptimizing 
anyway, in that I'm still trying too hard here to create a single dispatch 
algorithm that's efficient in all cases.  But the only way to do that is by 
giving up the best possible performance in more specific situations, in 
favor of a "generally good" approach.

Somewhere around here I've got a writeup I did about my ideas of the 
different major use cases for generic functions; I should probably dig that 
up and make note of the tradeoffs involved in what algorithms are likely to 
work better for each.