[PEAK] Transforming comprehensions to queries

Thu Jun 19 13:14:13 EDT 2008

So, this is just a rambling post on my current thoughts regarding the 
design of a system for taking object queries expressed as list 
comprehensions (or generator expressions), and using them to either 
find objects in memory, or perform corresponding SQL queries.

It has also been through a few rewrites and rounds of editing, which, 
aside from making it look easier to write than it actually was, may 
also have rendered it less comprehensible in the process.  (Because I 
might have dropped an important piece of reasoning from the place 
where I started this to where it ended up.)  So, if you have any 
questions, feel free to holler back.

Parsing Listcomps into Queries
------------------------------

I've recently added code to PEAK-Rules' "ast_builder" module to 
support parsing list comprehensions and generator expressions, so I 
now have a pretty good idea of what's involved in specifying 
same.  PEAK-Rules' code generator (combined with BytecodeAssembler) 
has a number of data structures defined that allow it to specify just 
about any Python expression as a simple tree, like:

    Add(Local('a'), Local('b'))

to represent the expression 'a+b'.  These objects also support 
generating the corresponding Python bytecode, so that function 
objects can be automatically generated to execute the expression.

So, the basic idea of creating queries from this stuff is that you 
write some expression like:

     [(e.when, e.title) for e in Event if e.date==today]

And, depending on whether you're using in-memory data or an external 
database, you would get a different result.

In the case of an in-memory database, for example, the expression 
might be executed exactly as-is -- a straightforward translation from 
expression to bytecode.

For an SQL database, however, the translation of the above query 
might look something like:

     list(query("select when, title from event e where e.date=%s", 
todate(today)))

That is, elements of the list comprehension have been "lifted" and 
converted to SQL, then replaced in the expression.  A more complex 
example.  Let's say that your query looks like this:

     [(e.when, e.title) for e in Event if some_function(e.date)==today]

Where "some_function" is an arbitrary Python function with no known 
translation to SQL.  Then the SQL-ized bytecode translation might 
effectively look like this:

     [(row[0], row[1]) for row in query("select when,title,date from 
event") if some_function(row[2])==today]

It's far from optimal, of course, but that's the price you pay.  You 
should, of course, be able to easily define optimization rules that 
handle this scenario for a given mapping, as long as an SQL 
translation is possible.

For example, if you know you will be using some_function() on an 
event's date field a lot, you could define your db schema such that 
you store the value in a separate (and possibly-indexed) 
field.  Then, you could write a rule that matches the pattern 
'some_function(x.date)' where 'x' is a variable of type 'Event', and 
rewrite the sub-expression accordingly.

Pattern Matching Rewrites
-------------------------

PEAK-Rules already does rewrites like this internally.  For example, 
here's a rule from peak.rules.predicates that matches expressions of 
the form 'type(x) is y', in order to turn them into type/class matches:

when(...,
     "expr in Call and expr.func in Const and (expr.func.value is type)"
     " and len(expr.args)==1"
)

Breaking this down, it matches when "the expression is a Call() node 
and the expression being called is a constant, and the value of that 
constant is the 'type' builtin, and the call has a single positional 
argument".  Ugly, but it does the job.

I've been thinking, however, that it would be very useful to be able 
to instead specify the above expression this way:

    "expr in `type(`x`)`"

The idea here is that backquotes around a name would denote a "bind 
variable", and backquotes around anything else would constitute a 
"match pattern".  If the match pattern is successfully matched, then 
the "bind variables" in it are set to equal the corresponding parts 
of the matched expression.

So, in "expr in `type(`x`)` and x in Const" would expand to adding 
"and expr.args[0] in Const" to the earlier example, giving us this:

     "expr in Call and expr.func in Const and (expr.func.value is type)"
     " and len(expr.args)==1 and expr.args[0] in Const"

But of course, "expr in `type(`x`)` and x in Const" is a lot easier 
to type.  :)  Such a shorthand will probably be essential to the 
implementation of a query rewrite system, just because manually 
translating these things is tedious and error-prone.

In fact, just writing this has shown me that the rewrite rules in 
PEAK-Rules are actually incorrect, in that the rule I quoted above 
will also match 'type(x, y=42)', and the body of the rule in question 
will then silently throw away the y=42 part: clearly an error.

But an automatic translation of a pattern match wouldn't make that 
mistake -- once a correct rule is defined for matching Call() 
patterns, for example, then all subsequent Call() translations will 
be done correctly.

So, this suggests that adding a pattern match facility to PEAK-Rules 
should be done before serious work on the query translator begins, 
and so I'll need to write a separate post about the details of that 
implementation.  I've already planned to add such a feature to 
PEAK-Rules, so it's not really a new idea.  It just needs to get 
implemented, basically.

Rewriting Queries
-----------------

Okay, so how do we rewrite a query into an SQL expression plus 
Python?  In the simplest case, the rewrite of a query is itself; in 
other words, no rewrite at all.  In the extreme case, the rewrite is 
a complete replacement of the expression with code that does a 
database query.  In between, there is some mixture of the two.

A general rewrite strategy would begin by replacing *collections* in 
the query, moving from left to right through all the 'in' and 'if' 
clauses, and binding the expressions in the 'for' clauses so that 
expressions further to the right can be interpreted 
correctly.  (Collections in the output expression would be replaced 
last, since even though the output is the leftmost expression, it's 
evaluated as if it were the rightmost.)

The simplest collections to spot and replace are constants referring 
to classes (or potentially, to predefined query objects).  These can 
simply be replaced with subexpressions representing the appropriate 
tables or queries in the database.

Collections, however, that are attributes of other objects, are more 
complex.  These attributes can only be spotted when the expression's 
underlying database type is known.  In other words, we only know that 
'e.date' refers to the event.date field in the database, if we know 
that 'e' is an event.  And we only know that 'e' is an event, if 'e' 
was defined as a 'for' variable against an 'in' whose type is also 
known to us.  Otherwise, we will have to leave that particular 
subexpression to be executed in Python.

In any case, that's why we'll need to work from left to right, so 
that as 'for' variables are encountered, we can keep track of their 
corresponding type.  This also implies that we will need a generic 
function that takes an arbitrary expression tree, and returns its 
database type.  By default, it will return "unknown", but it would be 
able to handle local variables by looking up the type assigned, and 
it could handle Getattr() nodes by recursively identifying the type 
of the target, then looking up the mapping for the named attribute.

We can then collapse any *consecutive* "for/in" clauses where we have 
a known database type for the variable into a *single* for/in clause 
that loops over a database query, yielding tuples containing each of 
the subexpressions needed by subsequent 'if' or 'in' clauses, or 
needed by the output expression.

Subexpression Lifting
---------------------

Within each 'if' or 'in' clause, we must replace Python expressions 
with database expressions wherever possible.  Any subexpression whose 
component parts are sourced from the database, and whose operator can 
be performed by the database, should be replaced with a database 
expression.  In the case of 'if' clauses, any top-level conditional 
expressions that can be expressed in the database should be removed 
entirely, and replaced by adding conditions to the corresponding 
'in'-clause query.  (TODO: what about OR-ed conditions in an 'if' clause?)

In the case of db data in 'in' clauses and the output expression, or 
in the case of 'if' conditions that can't be expressed in the db, but 
nonetheless are based on db values, the database-backed expression 
must be replaced with a reference to a field of the rows returned by 
the database query, and the query must be defined so as to include 
that field in its output.

These output (SELECT) fields should be added to the query in order of 
appearance in the output expression, so that in the (hopefully 
common) case where the query can be expressed entirely in SQL, the 
entire query can be replaced by a simple call to an SQL query, 
without any loop code in Python.

So, let's say we have a generic function, lift(), that takes an 
expression and replaces as many component parts as possible with 
db-backed expressions, returning a replacement Python expression, and 
a mapping of the db expressions used by that Python expression.

Thus "lift(a.b)" (actually "lift(Getattr(Local('a'), 'b'))") would 
return one of the following:

      Getattr(Local('a'), 'b')), {}
      Getattr(Local('$1'), 'b'), {'$1':field(sometype, a_attr)}
      Local('$1'), {'$1': field(sometype, b_attr)}

depending on whether 'a' and '.b' were known types/attributes.  In 
the first case, 'a' is a variable of unknown type, so the expression 
is left as-is, and no new information needs to be added to the query.

In the second case, the type of 'a' is known, but the 'b' attribute 
is not db-backed, and thus the '.b' operation must occur in 
Python.  However, since 'a' is known, we can insert a marker object 
to stand in its place, and add the marker's definition to our output mapping.

In the third case, the 'b' attribute is in the db, so we can just 
query it directly and assign it to '$1' (a generated local variable name).

Once we've lifted all the db-backed subexpressions from every clause 
of the query, we can use the map of generated local variables ('$1', 
'$2', etc.) to write the SELECT clause(s) of our queries, and to 
write the effective 'for' clause of our replacement.  For example:

    $1 for $1, $2 in query(...) if $2.foo==2

Whew.  Complex, but so far not too complicated.  The rewrites all 
take place at the AST layer, so the result can then just get turned 
into a function using BytecodeAssembler.

Query Parameters and Pre-compiling
----------------------------------

For performance reasons, we're not going to want to do all this stuff 
every time a query is executed, which means that we're going to want 
to compile most queries once and then forget about them.  In fact, we 
want to NOT compile most queries, but compile them on first execution 
and then save the compilation -- in much the same way that PEAK-Rules 
does for generic functions.

That means that ideally, we'll want to define queries in such a way 
that they can accept *parameters*...  which means functions, and 
therefore decorators.  I can see using something like:

     @db.defquery("[a.b for a in A if a.c==d]")
     def b_for_a_by_c(d):
         """Return a.b where a.c==d"""

This would make b_for_a_by_c into a special function that compiles 
its query on first call, then replaces its own code with the compiled 
query, so that subsequent calls go right on through.  It should also 
be possible to use this to define attributes and methods within a 
class, such that 'self' is treated as a parameter of known type in 
the query expression.

Further, if these query functions or properties exposed metadata 
about their type and definition, then it would be possible to use 
them in other query expressions... so long as they were not 
overridden in subclasses, that is.

It would be nice if there were a way to specify the types of any 
non-'self' arguments to a query function; perhaps keyword arguments 
to 'defquery' (or whatever it ends up being called) would do the 
trick.  This might not be necessary, since the type can often be 
inferred (or ignored) based on usage.  But if the parameter is used 
in an 'in' clause, or is a collection of some kind, knowing more 
would be helpful.

Open Issues
-----------

* OR-ed subexpressions in 'if' clauses

* collection-wide operations, e.g.  x.y == z.y where .y is a 
collection attribute

* aggregate expressions, like sum() in an output expression

* Mapping from field values to objects, and vice versa

* Subclasses overriding property/method query defs

* Parameterized queries with optional conditions -- e.g. "[x for x in 
Foo if x.bar==param or param is None]" -- we probably would like to 
generate code that lifts the 'param is None' test outside the loop, 
thus producing  [x for x in Foo] if param is None else [x for x in 
Foo if x.bar==param].  This probably needs to be looked at in 
connection with the general case of OR-ed subexpressions, but it 
could also apply to any loop-invariant parts of any subexpression - 
i.e., loop-invariant expressions should be lifted out of the loop.

Anyway, I'll try to nail these loose ends down a bit more in future posts.