[TransWarp] Towards a query theory, part 1: filters and
Phillip J. Eby
pje at telecommunity.com
Tue Oct 14 15:26:58 EDT 2003
At 11:43 AM 10/14/03 -0500, Ian Bicking wrote:
>On Tuesday, October 14, 2003, at 11:12 AM, Phillip J. Eby wrote:
>>Generating SQL from "easy" queries involves a lot of details, but is
>>straightforward, on the whole. Mostly it's a process of walking the
>>query to extract table aliases, joins, and criteria.
>>Here are PEAK's "hard" query requirements:
>>* Support correlations (e.g. the "find employees that lives in the same
>>city as their supervisor, but was born in a different country" example)
>If SQLObject were extended to do this, it might look like:
>var = Employee.var
>query = AND(var.address.city == var.supervisor.address.city,
> var.birthplace.city != var.supervisor.birthplace.city)
Here's an interesting question for you. How would you know in this case
that the two 'var.supervisor' references are to the same object? Yes, I
know it's a one-to-one relationship here, but if it were a one-to-many, the
semantics get interesting. :)
>SQLObject's basic metaphor is one of tightly encapsulated row/objects, so
>we have to produce objects, not arbitrary tuples/relations. I.e., we have
>to find instances of Employee, not just cities, or a combination of
>employees and their supervisors.
Understood. For purposes of conceptual simplification, I'm considering
selecting an object to mean selecting the *primary key* of the object, not
any of its attributes. This allows me to translate to "standard"
>>* Support aggregation (count/min/max/avg) over groups (GROUP BY in SQL)
>I would not put this in the query. It doesn't effect which objects we
>find, but rather what results we return. Maybe if we wanted the total
>salary paid, listed according to supervisor:
>var = Employee.var
>select(var.supervisor, groupBy=var.supervisor, aggregate=SUM(var.salary))
>Implementing that, though, would be challenging. Maybe easier if select()
>was given more explicit information about what it was expected to do.
Indeed. One of my goals here is to get a SQL-independent query language
with *at least* the expressive power of SQL. But, due to a limited time
budget, I will either crack this problem this week, or else decide on a
known-to-be-crackable subset of the problem. Most likely, that would be
the "easy" subset.
>>* Support cross-database joins, where an object's data resides in more
>>than one system, or where some of an object's features are the result of
>>computation in Python over data retrieved from an external source.
>Well, this is just a nice feature that's hard. If you support Python
>execution you can always do it brute-force. If you want to do it right
>you need a query optimizer. Holding out for a single theory of queries
>that supports this on top of everything else seems like a hurdle that is
>too high to leap (at least in one bound).
Well, the published work on Gadfly 2 indicates that if you're willing to
live with only equijoins and AND() criteria (only allowing OR(), NOT(), and
inequalities to be applied to a single table at a time), it's actually
rather simple to support. The greedy-optimization "find next thing to
join" algorithm in Aaron's IPC 8 paper looks like just the thing. He just
didn't explain how to do cross-table OR/NOT/inequalities, UNION, outer
joins, aggregation, etc.
I see in principle how to extend this, using a concept of "input tables"
and the "output table" for a query. All the input tables from a single
database can be grouped into a subquery whose output table is then used as
an input table in the top-level query. The subquery is then converted into
the query language of the target database.
It seems to me that the other "missing features" of Gadfly 2 are similarly
implementable by breaking a query into subqueries that implement the
feature by combining input tables into a "virtual" input table for the
containing query. But, it may be that there are easier ways to do aspects
For example, if an OR() contains criteria against more than one table, the
OR() should be used as a filter against the query's output table. If
different branches of the OR() include joins of their own, they have to be
marked as outer joins. Single-table inequality comparisons that are not
children of an OR() expression can be used as filters directly against the
input table. Similar restrictions apply to NOT() nodes - if they contain
subexpressions dealing with more than one table, they have to be applied to
the output rather than the input.
UNION, EXISTS, outer joins, aggregation, and the like seem to me to require
"virtual" tables to supply alternate join semantics, however. For example,
a UNION table would "join" two input tables by merging them. An
aggregation would "flatten" an input table to a single row, or to rows per
distinct combination of key columns.
For single-database SQL generation, it's sufficient to have data structures
that encode these concepts, since they can be mapped back to SQL. For
multi-database query generation, I think each kind of virtual table would
need some optimization rules for how it would join itself to other tables
at the equijoin level. And, virtual tables that encapsulate tables from
more than one database are also likely to be "interesting".
Of course, we could consider our top-level output query to be just another
"virtual table", one that happens to be based on equijoining its input
tables. So, its input tables must support whatever interface is needed to
do an equijoin with that table, just as each other table type's input
tables must support whatever interface is needed for that kind of
Hm. It seems to me that implementing this concept would require, at minimum:
1. An algorithm to map logical/traversal/etc. operators into "table
operators" (equijoins, outer joins, EXISTS, aggregates, remap, etc.) and
filters (logical expressions composed of comparisons).
2. An understanding of each table operator to sufficient to allow its
3. Defining interfaces that table operators should provide for use by an
enclosing table operator.
In a later e-mail, I'll begin an analysis of items 1 and 2.
More information about the PEAK