[PEAK] Pattern matching and function-expansion
Phillip J. Eby
pje at telecommunity.com
Wed Jun 25 15:00:00 EDT 2008
Another brain dump, this one on implementing pattern matching in
PEAK-Rules. Last time, I wrote about using backquoted syntax in
rules to do pattern matching of syntax trees, to easily support
defining substitution rules for translating list comprehensions to
Python+SQL. This is a rough plan for implementing pattern matching
and related meta-syntactic operations.
Patterns and Bind Variables
---------------------------
One of the issues left open last time was how to handle patterns
embedded in an expression that might not be expanded. That is, how
should the parser treat a backquoted expression outside of the one
right place where such a pattern would be used?
One possibility would be to treat backquotes as denoting an "AST
constant" -- that is, parse the backquoted expression and return a
constant of some sort. But this would produce unexpected results if
somebody mistakenly used backquotes for their primary meaning in
Python 2.x (a synonym for repr()). We could, of course, just reject
patterns outside of their special place - the right hand side of an
"in" operation.
That's unattractive, however, in two ways. First, it means that
we're building in special syntax handling for a feature that's only
going to be used by generic functions that operate on Python syntax
trees. And second, it will likely end up duplicating work that could
be used for other metasyntax operations, like inlined functions
(which we'll probably need for query stuff).
Another issue is that using backquotes to denote both bind variables
*and* patterns, is that there's a contextual ambiguity: is `x` a
pattern matching a variable named "x", or is it the bind variable
named "x"? Without careful reading, it's not going to be clear, and
mistakes are likely.
So, I think the solution is to:
1. Ban the use of backquotes in ordinary PEAK-Rules expressions
2. Allow metasyntactic functions (i.e., specially-defined functions)
the ability to use an alternate parser for their arguments.
Suppose, for example that we defined a special function called
"syntax" that returns a special object for patern matching, so that a
rule like this:
expr in syntax(type(`x`) is `y`)
Would match `expr` if it's of the form "type(...) is ...". Or,
syntax() could accept two arguments, and be a boolean expression in
its own right:
syntax(expr, type(`x`) is `y`)
Either way, one of the above would replace the previous proposal, which was:
expr in `type(`x`) is `y``
This would let us define the "syntax()" function in another module,
that you only import if you want to do syntax pattern matching.
The mechanism to implement this is actually pretty
straightforward. The default expression builder just needs to have a
changed "Call()" method that checks if the thing being called (e.g.
"syntax") is a compile-time constant that's registered as an
"inlinable function". If so, it invokes the compile-time version of
the function, passing in the raw (untranslated) Python AST
structures. The compile-time code then would have the option of
using a different builder to parse any of its arguments, and thus
treat one of them as a match pattern instead of as a
to-be-interpreted expression.
In the case of the syntax() builder, it would accept backquotes, but
restrict them to being used around a single "NAME" token, to indicate
a bind variable.
The Matching Algorithm
----------------------
There are a couple of different ways to handle the syntax matching
process. Since the syntax() special function will be using its own
builder, one possibility is to just build the match algorithm right
in. This has the advantage of speed in parsing, but is at something
of a disadvantage for testing and simplicity.
The other alternative is to perform the matching on an already-built
AST tree. This has the advantage of making the builder much simpler:
it can simply add a backquote operation to the default expression
builder. Second, the match code itself might be simpler, because
most AST node types should be matchable using the same basic rule, by
recursively matching each sub-node of the node. The code can also be
tested in a more isolated way, since it's only PEAK-Rules AST nodes
that will be used in the tests, rather than needing to also use the
Python AST structures.
So, here's a basic spec for AST node matching. There will be a
recursive generic function:
match_predicate(pattern, expr, binds) -> predicate
where "pattern" is an AST describing the pattern to match, "expr" is
the AST being matched, "binds" is a dict mapping bind-variable names
to expression ASTs, and "predicate" is a PEAK-Rules predicate
describing the conditions that must be true for "expr" to match "pattern".
Rules defined for this function will determine what to do based on
the type of "pattern". If "pattern" is a bind variable, the "binds"
dict is updated in-place, inserting "expr" under the bind variable's
name. If there's already an entry there, this predicate is returned,
performing an equality comparison between the new binding and the old
binding of the variable:
Test(Truth(Compare(expr, (('==', old_binding),))), True)
(This is so that patterns like "`x` is not `x`" will actually compare
the two "x"s and see if they're equal.)
Otherwise, match_predicate() on a bind variable returns "True" as its
predicate, meaning that it always matches.
Hm... actually, we should probably have a special exception for a
bind variable named "_", so that you don't have to make up names for
bindings you don't care about. That way, you can use it as a "don't
care" pattern. So, the match rule for Bind() should not define a
binding or return a non-True predicate if the bind variable name is "_".
Matching Structures and AST nodes
---------------------------------
For most node types other than Bind(), the predicates are a bit more
complex. By default, the predicate should be an exact (istype) match
of the node type, intersected with a recursive application of
match_predicate() to each of the subnodes of the node. Something like:
predicate = Test(expr, IsInstance(istype(type(pattern))))
for pos, item in enumerate(pattern):
if pos:
predicate = intersect(
predicate,
match_predicate(item, Getitem(expr, Const(pos)), binds)
)
return predicate
This then matches each subpattern against a Getitem() expression on the node.
A near-identical approach can be used for matching sequences of fixed
size, except replacing the istype() test with a len() equality
test. This would work okay for most node types that have sequence
fields (such as Compare), but might be more complex for others.
Const nodes would compare equality of contents (along with an istype
test for Const-ness), while strings would directly compare
equality. "None" would use an "is None" test.
These rules should be sufficient to match most syntax, with the
caveat that Dict nodes would have to define their key-value pairs in
exactly the same order as the target. (However, it's unlikely that
we'll want or need much matching of dictionary displays, so no big deal.)
Similarly, Call nodes will only match if the exact same keyword
argument order is used, and it won't be possible to interchangeably
match different ways of specifying the same call (using either
positional or keyword or * or ** for the same argument(s)).
For most applications, this won't be an issue, but it does mean that
"inlinable functions" will have to be expanded in some other way than
using syntax patterns. (Which is okay, because such functions need
builder-level access to the AST anyway.)
Matching Variable-length Sequences, Dictionaries, etc.
------------------------------------------------------
If at some point we want to support variable-length sequence
matching, then we'll need a slightly more sophisticated approach to
sequence matching, using a range of length matches, and allowing
fixed-position tests at the beginning and/or end of the
sequence. There will need to be a way to get a minimum and maximum
match length for each part of the sequence in the pattern, and thus
determine the overall size range of the sequence. It would then also
be possible to use positive offsets from the first of the list up
until the first variable-size subpattern, and negative offsets from
the end of the list back to the last variable-size subpattern. It
would probably not be practical to allow multiple variable-size
subpatterns, however, nor any fixed-size subpatterns between them.
For dictionary matching, some of the same issues apply, in that to do
partial matching of a dictionary, one would need a way to specify a
wildcard key/value pair. Otherwise, dictionaries would have to be
matched by key... and all the keys would need to be constants,
rather than patterns themselves.
Fortunately, these are only issues if we need to do more
sophisticated pattern matching at a future date. It's not clear at
the moment that we will need them for the kind of syntax matching
used to do query translation.
Inline Function Expansion
-------------------------
In query processing, we will sometimes want to define functions or
methods that must be expanded to access SQL-level tables or functions
for optimal performance. For example, there might be methods or
properties (or even class-level collections) that perform query
restriction based on "active objects only", or "allowed objects
only", to name just a few possibilities.
Such functions or methods need to be effectively "inlined" into the
query to prevent repeated SQL queries and doing unnecessary loops or
joins in the Python code.
So, if we make it possible to register "special function" objects
with PEAK-Rules, and have a simple way to define their expansion, we
can manage the expansion via PEAK-Rules.
Of course, one implication of this is that the definition of a
special function can't change once used, or any query that was
compiled based on it will be incorrect. This is probably not a big
deal, but a related implication is that expanded *methods* can't be
polymorphic. That is, you can't override a query method in a
subclass, since it will then have more than one definition.
(Now, in principle, one could expand a special method in such a way
that it reduced to a series of "or" or "if/else" conditions between
the various subclass definitions, such that the resulting query would
use outer joins to subclass tables, and a bunch of "or"
conditions. But I think that's probably a bridge we should wait to
cross until we actually get to it.)
We may need a way to distinguish functions/methods/properties that
return *conditions* from those that define collections or
values. But it may be that we can get away with these functions'
compile-time model being simply a straightforward
macro-expansion. There maybe some complexities introduced by the
need to handle methods and properties, but that won't show up until
we're defining a formal O-R mapping model. (The initial prototype
will use hand-coded rules (based on syntax matching) rather than
being a metaprogrammed O-R mapping.)
Open Issues/Todo
----------------
* Accessing bind variables from method bodies
* Prototyping query translation
* Defining a framework for the O-R mapping process
More information about the PEAK
mailing list