[PEAK] Pattern matching and function-expansion

Phillip J. Eby pje at telecommunity.com
Wed Jun 25 15:00:00 EDT 2008

Another brain dump, this one on implementing pattern matching in 
PEAK-Rules.  Last time, I wrote about using backquoted syntax in 
rules to do pattern matching of syntax trees, to easily support 
defining substitution rules for translating list comprehensions to 
Python+SQL.  This is a rough plan for implementing pattern matching 
and related meta-syntactic operations.

Patterns and Bind Variables

One of the issues left open last time was how to handle patterns 
embedded in an expression that might not be expanded.  That is, how 
should the parser treat a backquoted expression outside of the one 
right place where such a pattern would be used?

One possibility would be to treat backquotes as denoting an "AST 
constant" -- that is, parse the backquoted expression and return a 
constant of some sort.  But this would produce unexpected results if 
somebody mistakenly used backquotes for their primary meaning in 
Python 2.x (a synonym for repr()).  We could, of course, just reject 
patterns outside of their special place - the right hand side of an 
"in" operation.

That's unattractive, however, in two ways.  First, it means that 
we're building in special  syntax handling for a feature that's only 
going to be used by generic functions that operate on Python syntax 
trees.  And second, it will likely end up duplicating work that could 
be used for other metasyntax operations, like inlined functions 
(which we'll probably need for query stuff).

Another issue is that using backquotes to denote both bind variables 
*and* patterns, is that there's a contextual ambiguity: is `x` a 
pattern matching a variable named "x", or is it the bind variable 
named "x"?  Without careful reading, it's not going to be clear, and 
mistakes are likely.

So, I think the solution is to:

1. Ban the use of backquotes in ordinary PEAK-Rules expressions

2. Allow metasyntactic functions (i.e., specially-defined functions) 
the ability to use an alternate parser for their arguments.

Suppose, for example that we defined a special function called 
"syntax" that returns a special object for patern matching, so that a 
rule like this:

     expr in syntax(type(`x`) is `y`)

Would match `expr` if it's of the form "type(...) is ...".  Or, 
syntax() could accept two arguments, and be a boolean expression in 
its own right:

     syntax(expr, type(`x`) is `y`)

Either way, one of the above would replace the previous proposal, which was:

     expr in `type(`x`) is `y``

This would let us define the "syntax()" function in another module, 
that you only import if you want to do syntax pattern matching.

The mechanism to implement this is actually pretty 
straightforward.  The default expression builder just needs to have a 
changed "Call()" method that checks if the thing being called (e.g. 
"syntax") is a compile-time constant that's registered as an 
"inlinable function".  If so, it invokes the compile-time version of 
the function, passing in the raw (untranslated) Python AST 
structures.  The compile-time code then would have the option of 
using a different builder to parse any of its arguments, and thus 
treat one of them as a match pattern instead of as a 
to-be-interpreted expression.

In the case of the syntax() builder, it would accept backquotes, but 
restrict them to being used around a single "NAME" token, to indicate 
a bind variable.

The Matching Algorithm

There are a couple of different ways to handle the syntax matching 
process.  Since the syntax() special function will be using its own 
builder, one possibility is to just build the match algorithm right 
in.  This has the advantage of speed in parsing, but is at something 
of a disadvantage for testing and simplicity.

The other alternative is to perform the matching on an already-built 
AST tree.  This has the advantage of making the builder much simpler: 
it can simply add a backquote operation to the default expression 
builder.  Second, the match code itself might be simpler, because 
most AST node types should be matchable using the same basic rule, by 
recursively matching each sub-node of the node.  The code can also be 
tested in a more isolated way, since it's only PEAK-Rules AST nodes 
that will be used in the tests, rather than needing to also use the 
Python AST structures.

So, here's a basic spec for AST node matching.  There will be a 
recursive generic function:

    match_predicate(pattern, expr, binds) -> predicate

where "pattern" is an AST describing the pattern to match, "expr" is 
the AST being matched, "binds" is a dict mapping bind-variable names 
to expression ASTs, and "predicate" is a PEAK-Rules predicate 
describing the conditions that must be true for "expr" to match "pattern".

Rules defined for this function will determine what to do based on 
the type of "pattern".  If "pattern" is a bind variable, the "binds" 
dict is updated in-place, inserting "expr" under the bind variable's 
name.  If there's already an entry there, this predicate is returned, 
performing an equality comparison between the new binding and the old 
binding of the variable:

     Test(Truth(Compare(expr, (('==', old_binding),))), True)

(This is so that patterns like "`x` is not `x`" will actually compare 
the two "x"s and see if they're equal.)

Otherwise, match_predicate() on a bind variable returns "True" as its 
predicate, meaning that it always matches.

Hm...  actually, we should probably have a special exception for a 
bind variable named "_", so that you don't have to make up names for 
bindings you don't care about.  That way, you can use it as a "don't 
care" pattern.  So, the match rule for Bind() should not define a 
binding or return a non-True predicate if the bind variable name is "_".

Matching Structures and AST nodes

For most node types other than Bind(), the predicates are a bit more 
complex.  By default, the predicate should be an exact (istype) match 
of the node type, intersected with a recursive application of 
match_predicate() to each of the subnodes of the node.  Something like:

    predicate = Test(expr, IsInstance(istype(type(pattern))))
    for pos, item in enumerate(pattern):
        if pos:
            predicate = intersect(
                match_predicate(item, Getitem(expr, Const(pos)), binds)
    return predicate

This then matches each subpattern against a Getitem() expression on the node.

A near-identical approach can be used for matching sequences of fixed 
size, except replacing the istype() test with a len() equality 
test.  This would work okay for most node types that have sequence 
fields (such as Compare), but might be more complex for others.

Const nodes would compare equality of contents (along with an istype 
test for Const-ness), while strings would directly compare 
equality.  "None" would use an "is None" test.

These rules should be sufficient to match most syntax, with the 
caveat that Dict nodes would have to define their key-value pairs in 
exactly the same order as the target.  (However, it's unlikely that 
we'll want or need much matching of dictionary displays, so no big deal.)

Similarly, Call nodes will only match if the exact same keyword 
argument order is used, and it won't be possible to interchangeably 
match different ways of specifying the same call (using either 
positional or keyword or * or ** for the same argument(s)).

For most applications, this won't be an issue, but it does mean that 
"inlinable functions" will have to be expanded in some other way than 
using syntax patterns.  (Which is okay, because such functions need 
builder-level access to the AST anyway.)

Matching Variable-length Sequences, Dictionaries, etc.

If at some point we want to support variable-length sequence 
matching, then we'll need a slightly more sophisticated approach to 
sequence matching, using a range of length matches, and allowing 
fixed-position tests at the beginning and/or end of the 
sequence.  There will need to be a way to get a minimum and maximum 
match length for each part of the sequence in the pattern, and thus 
determine the overall size range of the sequence.  It would then also 
be possible to use positive offsets from the first of the list up 
until the first variable-size subpattern, and negative offsets from 
the end of the list back to the last variable-size subpattern.  It 
would probably not be practical to allow multiple variable-size 
subpatterns, however, nor any fixed-size subpatterns between them.

For dictionary matching, some of the same issues apply, in that to do 
partial matching of a dictionary, one would need a way to specify a 
wildcard key/value pair.  Otherwise, dictionaries would have to be 
matched by key...  and all the keys would need to be constants, 
rather than patterns themselves.

Fortunately, these are only issues if we need to do more 
sophisticated pattern matching at a future date.  It's not clear at 
the moment that we will need them for the kind of syntax matching 
used to do query translation.

Inline Function Expansion

In query processing, we will sometimes want to define functions or 
methods that must be expanded to access SQL-level tables or functions 
for optimal performance.  For example, there might be methods or 
properties (or even class-level collections) that perform query 
restriction based on "active objects only", or "allowed objects 
only", to name just a few possibilities.

Such functions or methods need to be effectively "inlined" into the 
query to prevent repeated SQL queries and doing unnecessary loops or 
joins in the Python code.

So, if we make it possible to register "special function" objects 
with PEAK-Rules, and have a simple way to define their expansion, we 
can manage the expansion via PEAK-Rules.

Of course, one implication of this is that the definition of a 
special function can't change once used, or any query that was 
compiled based on it will be incorrect.  This is probably not a big 
deal, but a related implication is that expanded *methods* can't be 
polymorphic.  That is, you can't override a query method in a 
subclass, since it will then have more than one definition.

(Now, in principle, one could expand a special method in such a way 
that it reduced to a series of "or" or "if/else" conditions between 
the various subclass definitions, such that the resulting query would 
use outer joins to subclass tables, and a bunch of "or" 
conditions.  But I think that's probably a bridge we should wait to 
cross until we actually get to it.)

We may need a way to distinguish functions/methods/properties that 
return *conditions* from those that define collections or 
values.  But it may be that we can get away with these functions' 
compile-time model being simply a straightforward 
macro-expansion.  There maybe some complexities introduced by the 
need to handle methods and properties, but that won't show up until 
we're defining a formal O-R mapping model.  (The initial prototype 
will use hand-coded rules (based on syntax matching) rather than 
being a metaprogrammed O-R mapping.)

Open Issues/Todo

* Accessing bind variables from method bodies

* Prototyping query translation

* Defining a framework for the O-R mapping process

More information about the PEAK mailing list