[PEAK] PEAK-Rules progress & plans
Phillip J. Eby
pje at telecommunity.com
Thu Jun 1 01:28:37 EDT 2006
So after completing the core framework last weekend, I'm doing some
thinking about what things I might want to accomplish this next
weekend. The following is a brain dump of some possible plans.
Expression Code Generation
--------------------------
The most likely first candidate is code generation for expressions. The
BytecodeAssembler package now has a rudimentary facility for generating
code from AST-like data structures, and I'd like to expand on that a
bit. What I'd like to end up with is an object representing a particular
point in the generation of a dispatch tree, that knows what expressions
have already been computed on the current execution path, and what
(generated) local variables hold them.
The high-level interface should be that you just ask this object to put an
expression value on the stack for you, and it figures out how to do
that. In the simplest case it would just issue a LOAD_FAST opcode for some
existing variable name (e.g. a function argument), but in more complex
cases it would need to calculate the value from possibly-cached values, and
maybe cache some intermediate values along the way if those values are used
as part of other expressions that might be needed later.
The other high-level feature needed would be the ability to "fork" the
object and produce another one that will be used along a different code
execution path, so that each execution path separately keeps track of what
values have or have not been computed as yet along that path.
Creating such an object isn't really a big deal, though. Most of the real
work is in managing the external data structure it needs in order to know
whether a particular subexpression is going to be reused along a particular
pathway, and thus whether caching that value is required.
The other interesting bit is going to be porting the existing expression
data structures from RuleDispatch. RuleDispatch uses an approach that's
based on representing nearly everything imaginable as a function
call. This is helpful for constant-folding, because it can just invoke the
actual function if all its arguments are constants.
But it's not as useful for code generation per se; sure you can turn
getattr(x,y) back into x.y, but it almost seems to be adding an unnecessary
layer. On the other hand, it doesn't require very many node types to
represent things: RuleDispatch has only 6 (besides Argument): OrExpr,
AndExpr, Tuple, Getattr, Const, and Call. But if I took this approach with
PEAK-Rules, I could actually whittle it down to just Const and Call. I
think I like that better. In fact, with so few node types, I could
probably embed a lot of it into BytecodeAssembler itself.
Okay, I think I'm sold. If it goes into BytecodeAssembler, though, there
needs to be a way to avoid constant folding if you don't want it. But
that's easy enough to do by having folding and non-folding constructors for
the Call type. All of these node types will need to be hashable and
comparable by value, not just identity, as PEAK-Rules will need to be able
to use them as dictionary keys. But it occurs to me that I could hack
something together for that using instancemethod objects to curry
functions. It would be a little ugly when trying to e.g. prettyprint an
expression, but that could be done with a specialized function for the purpose.
I was originally thinking of using structType from peak.util.Struct for
this purpose, and splitting it out into a StructType project, but I'd
rather not if I don't really need to, especially since it would be a
dependency for BytecodeAssembler in that case. Instancemethod objects are
hashable and comparable by value, so it ought to work. Complex expressions
will take longer to compare and hash, but it's probably a premature
optimization to worry about it.
Okay, so I think that covers the code generation front pretty well. I'll
add some ability to inline builtins (e.g. getattr) and to do constant
folding into BytecodeAssembler, using currying instead of closures in the
current Call, Const, and Global node types available there, so that
expression nodes are hashable and comparable.
The expression "state manager" will go into PEAK-Rules itself. The main
open question there is how it should generate variable names for
intermediate expressions. An obvious choice might be to "decompile" an
expression and use that string as the variable name, which would make it
quite clear what was going on when reading a disassembly or viewing a frame
in a debugger. Most rule subexpressions aren't terribly complicated (e.g.
'isinstance(x,Foo)') so this is probably not such a bad idea. OTOH,
representing types in such strings would be a pain, and the string size
could also blow up pretty quickly.
So, tentatively, I'm thinking the variable names could be something like
'?1: isinstance(x,?2)', where subexpressions are given numbers, but the
full variable name includes a simple non-nested representation using the
numbers to indicate subexpressions. So, any subexpression that appears
more than once gets its own name, and that name includes only unique
information that's not in other variables' names.
The only downside to this approach is that tests might end up with
unstably-assigned variable numbers, unless the algorithm that assigns the
numbers is sufficiently deterministic. But that is probably easy enough to
deal with. I can wait and see in practice if there are any issues there.
So the state manager facility should have a way to be created with a list
of known expressions, and it will need to be able to walk them to find out
what subexpressions they have, so it can compute overall reference counts
to decide what subexpressions are cacheable and give them variable
names. It will also need a way to decrease a subexpression's reference
count, in order to avoid caching values on branches where those values will
not actually end up being used.
Dispatch Tree Generation
------------------------
I don't think I can actually start on dispatch tree generation yet, even
after the expression code generation is done. Although I'll have a
notation for expressions to be tested, there aren't any data structures yet
to represent expression-based signatures or predicates thereof, let alone
dispatch indexes. Nonetheless I have a broad idea of what the code
generation for dispatch trees will look like, so I'm going to dump out my
thoughts so far.
I actually have been contemplating several alternative ways to do dispatch
tree code generation. Some are useful mainly for very large rulesets that
are well-segmented by certain parameters. For example, a generic function
to select "views" in Zope or peak.web is likely to be well-segmented by
attribute name, such that the extra cost of doing a dictionary lookup and a
second level of function call could easily pay off in performance, if there
are dozens of different names and each name has many possible classes with
registered views.
On the other hand, practical experience to date suggests that most useful
generic functions in real programs are actually quite small, perhaps with
only a few dozen methods in all, if even that many. For these functions,
other dispatch algorithms would perform much better. For example, I
estimate the current dispatch overhead of PEAK-Rules' current default
engine to be about 2 microseconds static overhead plus about 1 microsecond
per argument. A given invocation may take longer, if method combination
has to occur, but since each type signature is cached when that happens,
the function just gets faster on each call.
Meanwhile, if the function as a whole has a small enough number of methods
that the entire dispatch tree can be generated as inline code, the
resulting performance should be equal to or better than a hand-written
dispatch function, except for the overhead required to call out from the
dispatcher to the actual method, and any overhead imposed by any combined
methods.
What's more, any dispatch mechanism that uses lookup tables to speed up the
lookup process is going to want to "bottom out" to a pure-bytecode
dispatcher at some point, as soon as there are a small enough number of
candidate methods remaining. So it makes sense that the next engine I
implement should be the bytecode-based one, leaving table-driven dispatch
as an optimization that kicks in when some heuristic says it should.
It wouldn't need to be a specialized engine, even. The algorithm I have in
mind for tree building would follow roughly the same lines as the one in
RuleDispatch, it's just that the generated code would differ depending on
the type of index and how many entries that index had. In RuleDispatch,
the indexes determined what node type object would be used, so this isn't
substantially different, except that here the indexes will be repsonsible
for generating code, too. Or more precisely, there will be generic
functions that generate code based on the index type and engine type,
allowing different engine types to customize certain aspects of the code
generation process.
The basic algorithm for tree building is to build indexes for each tested
expression covered by the current rules, then find out which index is
"best" to dispatch on next, and build a dispatch node. Dispatching then
fans out from that point, using a memoization dictionary to merge any
branches of the tree that would end up being identical. At each branch,
the set of "current" rules is trimmed to exclude anything that's ruled out
by that branch, so the current set gradually dwindles to nothing, as do the
currently active indexes (since any expression already tested at a higher
level of the tree no longer needs to be tracked).
In the new system, this walk will be done against a bytecode object being
generated, with the expressions to test being generated inline (as
described in the previous section above). Each index would then generate
code for testing the expression's value against its possible values, and to
branch to different subtrees. Repeat recursively, and all is well.
Of course, I'm leaving out all the hard bits with hand-waving. One of the
most annoyingly complicated bits in RuleDispatch is how hard it is to find
which index is "best" to dispatch on next. There are several steps that
have to be done.
First, you can only use indexes on *enabled* expressions. An enabled
expression is one that either appears in the first position of a signature
(after removing already-dispatched expressions) or is an actual argument to
the generic function. The reason for this is to support things like "x!=0
and y/x>3", where the expression on the right can't be safely computed
until the one on the left has been determined to be true.
Once you've cut it down to enabled expressions, you have to determine which
index has the smallest average number of applicable rules for the branches
it would generate. The purpose of this is to rule out "bad" dispatch
expressions that don't narrow things down very much. If you have one
parameter that's a boolean value, and another parameter that's one of fifty
different strings, it's pretty clear that you should test the string
equality first, as it's going to cut down on the number of other tests you
have to do to find the applicable methods.
RuleDispatch does this by having indexes that are active and tracking from
the moment you add a rule to a generic function. I'm not sure I want to do
that for PEAK-Rules, because it slows down definition and there's a very
real possibility that you will never call the generic function even once.
More precisely, the issue is that RuleDispatch very eagerly computes a
complete dispatch index, taking into consideration all forms of
inter-criterial implication. The code to do this has to be very complex in
order to perform well, so I'd like to take a less-eager approach in
PEAK-Rules if possible. For example, it's possible that I could compute a
simpler statistic as a rough estimate of how "good" an index is for
dispatching, and then only compute the "real" statistic once it is needed.
Another possibility is that I could lazily generate code, in a manner
similar to the old PC game "Animal". In "Animal" the computer asks you a
series of yes-no questions to guess what animal you are thinking of. If it
doesn't know your animal, it asks you what question would distinguish the
animal it guessed from the animal you were thinking of, and thus it
"learns" to build a decision tree.
What I'm wondering is if I could do the same thing here. Initially just
generate a stub function that calls back to the engine saying "help, I
don't know what to do". The engine would examine the *current* arguments,
figure out what methods apply, and then expand the tree by adding the tests
that it had to use to determine which methods applied. Any branches for
the "paths not taken" would get callback stubs to repeat this process.
So, each time the function was called, it would "learn" what questions to
ask, and only the useful parts of the tree would be built. It would be
slower to answer things it had never answered before, of course, and a big
potential downside is that it might start taking longer and longer to
regenerate the bytecode. On the other hand, I think I maybe see a way to
redirect the jumps in an existing Code object so the entire tree wouldn't
be rebuilt, just incrementally expanded. The bytes would still have to be
copied, but it wouldn't be nearly so bad.
On the surface, this approach sounds really cool, especially since it
closely resembles what Psyco does to generate type-specific versions of a
function. However, it doesn't leave as well-defined a place for
table-based dispatches to occur. What would happen is that you'd get long
runs of unbalanced subtrees that effectively become linear searches.
It's too bad Python bytecode doesn't have any kind of "computed jump"
feature, otherwise such tables could be built right into the
bytecode. However, Python *can* do a dictionary get() plus about 6 integer
'<' operations in less than a microsecond, meaning a binary search through
64 possibilities after looking up a target value in a dictionary is quite
competitive when compared to a lookup and a function call.
So we might not even need a special optimization for table-based dispatch,
since N-way dispatch could be implemented with binary trees. Since it
takes a *lot* of rules to produce even 64 dispatch-relevant values for a
single expression (e.g. 64 different values that different rules test
against the same argument expression), this should actually be very
competitive as a dispatch technique. A tree of 32 values or less should be
another .1 microseconds faster, and a tree of 128 values is only another .1
microseconds slower.
The tree can even be lazily expanded, at the cost of more memory overhead
to keep track of where a given subtree was "left off", but it wouldn't be
much different from the same overhead that RuleDispatch has for such
lazily-expanded subtrees now. Probably the sensible thing to do is to
expand a subtree if the subtree's total size is likely less than the memory
required to remember how to expand it. Which means that laziness is really
just a speed optimization that could be left to a later phase, after the
eager algorithm is stable.
So, the resulting algorithm at this point seems not much different from
RuleDispatch, except in the details. This is good from a risk point of
view, although it may be a bit tedious to actually implement.
The main parts remaining that I'd like to optimize have to do with the
laziness of index generation and especially the computation of the "best"
index for dispatching. RuleDispatch goes through a lot of convolutions to
optimize the indices so that they can quickly compute their "goodness" for
an arbitrary collection of active rules. However, I wonder if it wouldn't
be easier and/or simpler to just recompute indexes on each subtree
buildout, especially if the indexes' "goodness" measurements were conducted
using a sampling approach.
That is, if you have a lot of rules, it might be simpler to take a
fixed-size random sampling of rules, and use those to decide what to
dispatch on next. My limited understanding of statistics says that the
chance of a randomly-chosen sample being significantly different in value
distribution from the population as a whole is very low, but meanwhile it
would put a fixed-size cap on the maximum analysis time needed to choose a
dispatch expression for a subtree.
On the other hand, at least RuleDispatch's indexes are a known and
implemented quantity, with a host of optimizations and a fair amount of use
-- albeit a near-total lack of unit tests. But RuleDispatch really only
has two types of indexes, anyway: hierarchy/identity indexes and range
indexes. There are several actual kinds of criteria, but nearly all of
them end up boiling down to hierarchy indexes once you factor out what
expression they apply to and what the lookup algorithm is.
Sigh. Unfortunately, a big part of what's left here is just plain tedious
to implement correctly. Viewed in broad strokes, it's all very elegant but
the details are a pain. When you do things like "and" criteria and "not"
criteria applied to isinstance() tests, the table-driven approaches start
getting messy quickly. It might actually make more sense to render such
criteria as literal inline isinstance() tests in some cases - perhaps even
most cases.
Multiple inheritance is another tricky corner case, because you pretty much
*have* to recompute indexes when you encounter a multiply-inheriting class
(while doing type testing). The base engine in the core framework avoids
this problem by recomputing *everything* whenever it encounters a
combination of types it hasn't seen before. That seems impractical in this
instance, however.
I guess at this point I'm really not certain how "wide" dispatches (i.e.
ones with many possible values) should work. Although the binary search
thing is possible in principle, it could easily break down such that the
whole function would have to be recoded whenever it encountered a new
type. (Range/equality/identity indexes and single-parent hierarchy indexes
could avoid this problem, and it wouldn't actually occur in practice except
on the first encounter with each type whose len(__bases__) was >1.)
It really does seem as though wide type dispatches want to happen via some
kind of dictionary lookup, rather than via binary searching. The
disadvantage is that then there's function calling overhead, but it might
be worth it.
Ah well. I think what I'm going to have to do is just give up on all these
optimization attempts and live with the fact that if/then trees will tend
towards linear-search behavior as dispatch widths increase, and figure out
a cost model for determining how long a linear search has to be before it
costs less to do a function call, and only then make the switch. It's also
possible that such an optimization could stack more than one condition into
the lookup, the way the core engine just dispatches on a tuple of all the
types.
But whatever the case, I think it's pretty clear I'm not going to be able
to get to that optimization without first implementing a more "naive"
approach, so for now I'll focus on the basics. I'm probably overoptimizing
anyway, in that I'm still trying too hard here to create a single dispatch
algorithm that's efficient in all cases. But the only way to do that is by
giving up the best possible performance in more specific situations, in
favor of a "generally good" approach.
Somewhere around here I've got a writeup I did about my ideas of the
different major use cases for generic functions; I should probably dig that
up and make note of the tradeoffs involved in what algorithms are likely to
work better for each.
More information about the PEAK
mailing list