[PEAK] Transforming comprehensions to queries
Phillip J. Eby
pje at telecommunity.com
Thu Jun 19 13:14:13 EDT 2008
So, this is just a rambling post on my current thoughts regarding the
design of a system for taking object queries expressed as list
comprehensions (or generator expressions), and using them to either
find objects in memory, or perform corresponding SQL queries.
It has also been through a few rewrites and rounds of editing, which,
aside from making it look easier to write than it actually was, may
also have rendered it less comprehensible in the process. (Because I
might have dropped an important piece of reasoning from the place
where I started this to where it ended up.) So, if you have any
questions, feel free to holler back.
Parsing Listcomps into Queries
------------------------------
I've recently added code to PEAK-Rules' "ast_builder" module to
support parsing list comprehensions and generator expressions, so I
now have a pretty good idea of what's involved in specifying
same. PEAK-Rules' code generator (combined with BytecodeAssembler)
has a number of data structures defined that allow it to specify just
about any Python expression as a simple tree, like:
Add(Local('a'), Local('b'))
to represent the expression 'a+b'. These objects also support
generating the corresponding Python bytecode, so that function
objects can be automatically generated to execute the expression.
So, the basic idea of creating queries from this stuff is that you
write some expression like:
[(e.when, e.title) for e in Event if e.date==today]
And, depending on whether you're using in-memory data or an external
database, you would get a different result.
In the case of an in-memory database, for example, the expression
might be executed exactly as-is -- a straightforward translation from
expression to bytecode.
For an SQL database, however, the translation of the above query
might look something like:
list(query("select when, title from event e where e.date=%s",
todate(today)))
That is, elements of the list comprehension have been "lifted" and
converted to SQL, then replaced in the expression. A more complex
example. Let's say that your query looks like this:
[(e.when, e.title) for e in Event if some_function(e.date)==today]
Where "some_function" is an arbitrary Python function with no known
translation to SQL. Then the SQL-ized bytecode translation might
effectively look like this:
[(row[0], row[1]) for row in query("select when,title,date from
event") if some_function(row[2])==today]
It's far from optimal, of course, but that's the price you pay. You
should, of course, be able to easily define optimization rules that
handle this scenario for a given mapping, as long as an SQL
translation is possible.
For example, if you know you will be using some_function() on an
event's date field a lot, you could define your db schema such that
you store the value in a separate (and possibly-indexed)
field. Then, you could write a rule that matches the pattern
'some_function(x.date)' where 'x' is a variable of type 'Event', and
rewrite the sub-expression accordingly.
Pattern Matching Rewrites
-------------------------
PEAK-Rules already does rewrites like this internally. For example,
here's a rule from peak.rules.predicates that matches expressions of
the form 'type(x) is y', in order to turn them into type/class matches:
when(...,
"expr in Call and expr.func in Const and (expr.func.value is type)"
" and len(expr.args)==1"
)
Breaking this down, it matches when "the expression is a Call() node
and the expression being called is a constant, and the value of that
constant is the 'type' builtin, and the call has a single positional
argument". Ugly, but it does the job.
I've been thinking, however, that it would be very useful to be able
to instead specify the above expression this way:
"expr in `type(`x`)`"
The idea here is that backquotes around a name would denote a "bind
variable", and backquotes around anything else would constitute a
"match pattern". If the match pattern is successfully matched, then
the "bind variables" in it are set to equal the corresponding parts
of the matched expression.
So, in "expr in `type(`x`)` and x in Const" would expand to adding
"and expr.args[0] in Const" to the earlier example, giving us this:
"expr in Call and expr.func in Const and (expr.func.value is type)"
" and len(expr.args)==1 and expr.args[0] in Const"
But of course, "expr in `type(`x`)` and x in Const" is a lot easier
to type. :) Such a shorthand will probably be essential to the
implementation of a query rewrite system, just because manually
translating these things is tedious and error-prone.
In fact, just writing this has shown me that the rewrite rules in
PEAK-Rules are actually incorrect, in that the rule I quoted above
will also match 'type(x, y=42)', and the body of the rule in question
will then silently throw away the y=42 part: clearly an error.
But an automatic translation of a pattern match wouldn't make that
mistake -- once a correct rule is defined for matching Call()
patterns, for example, then all subsequent Call() translations will
be done correctly.
So, this suggests that adding a pattern match facility to PEAK-Rules
should be done before serious work on the query translator begins,
and so I'll need to write a separate post about the details of that
implementation. I've already planned to add such a feature to
PEAK-Rules, so it's not really a new idea. It just needs to get
implemented, basically.
Rewriting Queries
-----------------
Okay, so how do we rewrite a query into an SQL expression plus
Python? In the simplest case, the rewrite of a query is itself; in
other words, no rewrite at all. In the extreme case, the rewrite is
a complete replacement of the expression with code that does a
database query. In between, there is some mixture of the two.
A general rewrite strategy would begin by replacing *collections* in
the query, moving from left to right through all the 'in' and 'if'
clauses, and binding the expressions in the 'for' clauses so that
expressions further to the right can be interpreted
correctly. (Collections in the output expression would be replaced
last, since even though the output is the leftmost expression, it's
evaluated as if it were the rightmost.)
The simplest collections to spot and replace are constants referring
to classes (or potentially, to predefined query objects). These can
simply be replaced with subexpressions representing the appropriate
tables or queries in the database.
Collections, however, that are attributes of other objects, are more
complex. These attributes can only be spotted when the expression's
underlying database type is known. In other words, we only know that
'e.date' refers to the event.date field in the database, if we know
that 'e' is an event. And we only know that 'e' is an event, if 'e'
was defined as a 'for' variable against an 'in' whose type is also
known to us. Otherwise, we will have to leave that particular
subexpression to be executed in Python.
In any case, that's why we'll need to work from left to right, so
that as 'for' variables are encountered, we can keep track of their
corresponding type. This also implies that we will need a generic
function that takes an arbitrary expression tree, and returns its
database type. By default, it will return "unknown", but it would be
able to handle local variables by looking up the type assigned, and
it could handle Getattr() nodes by recursively identifying the type
of the target, then looking up the mapping for the named attribute.
We can then collapse any *consecutive* "for/in" clauses where we have
a known database type for the variable into a *single* for/in clause
that loops over a database query, yielding tuples containing each of
the subexpressions needed by subsequent 'if' or 'in' clauses, or
needed by the output expression.
Subexpression Lifting
---------------------
Within each 'if' or 'in' clause, we must replace Python expressions
with database expressions wherever possible. Any subexpression whose
component parts are sourced from the database, and whose operator can
be performed by the database, should be replaced with a database
expression. In the case of 'if' clauses, any top-level conditional
expressions that can be expressed in the database should be removed
entirely, and replaced by adding conditions to the corresponding
'in'-clause query. (TODO: what about OR-ed conditions in an 'if' clause?)
In the case of db data in 'in' clauses and the output expression, or
in the case of 'if' conditions that can't be expressed in the db, but
nonetheless are based on db values, the database-backed expression
must be replaced with a reference to a field of the rows returned by
the database query, and the query must be defined so as to include
that field in its output.
These output (SELECT) fields should be added to the query in order of
appearance in the output expression, so that in the (hopefully
common) case where the query can be expressed entirely in SQL, the
entire query can be replaced by a simple call to an SQL query,
without any loop code in Python.
So, let's say we have a generic function, lift(), that takes an
expression and replaces as many component parts as possible with
db-backed expressions, returning a replacement Python expression, and
a mapping of the db expressions used by that Python expression.
Thus "lift(a.b)" (actually "lift(Getattr(Local('a'), 'b'))") would
return one of the following:
Getattr(Local('a'), 'b')), {}
Getattr(Local('$1'), 'b'), {'$1':field(sometype, a_attr)}
Local('$1'), {'$1': field(sometype, b_attr)}
depending on whether 'a' and '.b' were known types/attributes. In
the first case, 'a' is a variable of unknown type, so the expression
is left as-is, and no new information needs to be added to the query.
In the second case, the type of 'a' is known, but the 'b' attribute
is not db-backed, and thus the '.b' operation must occur in
Python. However, since 'a' is known, we can insert a marker object
to stand in its place, and add the marker's definition to our output mapping.
In the third case, the 'b' attribute is in the db, so we can just
query it directly and assign it to '$1' (a generated local variable name).
Once we've lifted all the db-backed subexpressions from every clause
of the query, we can use the map of generated local variables ('$1',
'$2', etc.) to write the SELECT clause(s) of our queries, and to
write the effective 'for' clause of our replacement. For example:
$1 for $1, $2 in query(...) if $2.foo==2
Whew. Complex, but so far not too complicated. The rewrites all
take place at the AST layer, so the result can then just get turned
into a function using BytecodeAssembler.
Query Parameters and Pre-compiling
----------------------------------
For performance reasons, we're not going to want to do all this stuff
every time a query is executed, which means that we're going to want
to compile most queries once and then forget about them. In fact, we
want to NOT compile most queries, but compile them on first execution
and then save the compilation -- in much the same way that PEAK-Rules
does for generic functions.
That means that ideally, we'll want to define queries in such a way
that they can accept *parameters*... which means functions, and
therefore decorators. I can see using something like:
@db.defquery("[a.b for a in A if a.c==d]")
def b_for_a_by_c(d):
"""Return a.b where a.c==d"""
This would make b_for_a_by_c into a special function that compiles
its query on first call, then replaces its own code with the compiled
query, so that subsequent calls go right on through. It should also
be possible to use this to define attributes and methods within a
class, such that 'self' is treated as a parameter of known type in
the query expression.
Further, if these query functions or properties exposed metadata
about their type and definition, then it would be possible to use
them in other query expressions... so long as they were not
overridden in subclasses, that is.
It would be nice if there were a way to specify the types of any
non-'self' arguments to a query function; perhaps keyword arguments
to 'defquery' (or whatever it ends up being called) would do the
trick. This might not be necessary, since the type can often be
inferred (or ignored) based on usage. But if the parameter is used
in an 'in' clause, or is a collection of some kind, knowing more
would be helpful.
Open Issues
-----------
* OR-ed subexpressions in 'if' clauses
* collection-wide operations, e.g. x.y == z.y where .y is a
collection attribute
* aggregate expressions, like sum() in an output expression
* Mapping from field values to objects, and vice versa
* Subclasses overriding property/method query defs
* Parameterized queries with optional conditions -- e.g. "[x for x in
Foo if x.bar==param or param is None]" -- we probably would like to
generate code that lifts the 'param is None' test outside the loop,
thus producing [x for x in Foo] if param is None else [x for x in
Foo if x.bar==param]. This probably needs to be looked at in
connection with the general case of OR-ed subexpressions, but it
could also apply to any loop-invariant parts of any subexpression -
i.e., loop-invariant expressions should be lifted out of the loop.
Anyway, I'll try to nail these loose ends down a bit more in future posts.
More information about the PEAK
mailing list