[TransWarp] Towards a query theory, part 6: building queries from filters

Thu Oct 16 22:29:53 EDT 2003

It's been a long day -- alternating between studying scripture (i.e. 
academic writings on relational DB's, joins, queries, and optimization) and 
testing my faith (by conducting experiments).  But I think I have finally 
seen the light (figured out the darn algorithm).

Here goes.  First, filter operators:

Logical: AND, OR, NOT
Variables: VAR
Comparators: EQ, GT, LT, NE, etc.
Traversal: MAYBE (outer join), FEATURE, FUNCTION

These are the only operators.  I'm ignoring aggregation and group-by for 
this first pass.

A "query builder" is an object that constructs a relational expression, 
according to instructions given it by a filter.  A query builder 
essentially holds one large theta-join construct, augmented with a list of 
peer outer joins (to other query builders).  After exhaustive (and 
exhausting) study, I've concluded that this is adequate to model all join 
constructs that are possible given the filter operators above.  (MAYBE 
represents an outer join to its entire subtree, so any nested MAYBE's are 
performed first, meaning that they can be contained in the parent query's 
MAYBE list.)

Rather than using EXISTS(SELECT ...) for subqueries, we will use IN(SELECT 
...).  This appears to be enormously easier to do correctly.

Query builders are initialized with a starting object type.  The query 
builder has to know for a given object type, what fact types 
(relationships, functions, etc.) exist, and what relational construct they 
map to, in the form of an RV.  Query builders also keep track of DV's, and 
whether they're shown, labelled with a correlation name, etc.

Initially, the builder has neither any RV's or DV's.  To begin building a 
query, you ask the builder for a DV of the desired starting object 
type.  You then supply the builder and the DV to the buildQuery method of 
your initial filter node.  The filter returns a logical expression, that 
you then pass to the select() method of the query builder, in order to 
receive your relational query expression.  (Which you then may build SQL or 
something else from.)

Each node type has different behavior for buildQuery().  AND() just returns 
the AND() of the logical expressions of its subnodes.  OR(), NOT(), and 
MAYBE() ask the builder for a new DV that is flagged as requiring any joins 
to be expressed as a subquery of appropriate type.  (More on this 
later.)  OR() and NOT() then return the OR or NOT of an 
'IN(oldDV,subqueryForChild)' condition for each child's subquery, if the 
child did any joins.  MAYBE() returns no logical expression, as it just 
finishes off its child's subquery and adds it to the current query 
builder's outer join list.

FEATURE() and FUNCTION() call the query builder's 'traverse(oldDV,path)' 
method to receive a new DV.  If the 'oldDV' is bound to an RV column, and 
the desired new DV is available in the same (virtual) table, no new RV is 
created.  Instead, a DV bound to one of the existing RV's columns is 
returned.  However, if the current DV is bound to a different table/column 
than the desired traversal requires, the appropriate base table (or view, 
whatever) is remapped to create a table alias, join criteria are added to 
the query, and one of the new RV's columns is returned as the new DV.

Anyway, once a feature or function gets a new DV, it just passes it along 
to its subexpression, and it returns whatever logical expression the 
subexpression returns.

So back to subquery DV's.  A subquery DV has some extra state that says to 
the query builder, "if you use me in a join, create the RV in this new 
subquery, and return the new DV from there instead of the parent 
query."  The tricky bit is that I'm not sure how to communicate to the new 
DV's user that they need to return an appropriate IN() expression to join 
it up.

Bleah.  This post is messy and disorganized, and the algorithm isn't much 
better... BUT, the problems in this algorithm are of a much different 
nature than what we've had before.  They are now practical considerations 
of how to divide up responsibilities between some objects, as opposed to 
wide-open theory questions.  I now have some whiteboard drawings that 
convey the structural mechanics of what I want this thing to do.  And, I 
could probably describe this algorithm more cleanly if I just wrote up 
transformation rules rather than trying to describe all the state tracking 
mechanisms needed to implement the rules.  So here goes...

* OR() and NOT() subtrees may not contain output variables; therefore any 
joins occurring under them must be converted to IN() subqueries, as is true 
for any one-to-many join where the subtree contains no output variables, 
but the current query does.  (The latter restriction prevents the subquery 
from breaking itself into further subqueries unless another OR() or NOT() 
is encountered.)

* MAYBE() converts its entire subtree into an RV that is outer-joined with 
the query that the MAYBE is immediately contained in.  (Unless the MAYBE 
subtree has no joins, in which case the contained criteria simply bubble up 
to the containing query.)

* All other joins simply add a new RV and join criteria to the current 
subtree, which is in relational terms a theta join.

* A completed query builder can convert itself to an RV by applying 
remap(), select(), and outer_join() as appropriate to the theta_join() of 
its contained RV's and join criteria.

* Comparisons get their implicit operand from the current DV.  An equality 
comparison to a variable, "labels" the current DV with that variable name.

* Correlation variable unification is done by each query builder before 
conversion to an RV.  Basically, this consists of substituting each 
occurrence of the variable with a reference to some other occurrence, such 
that there is a path from any occurence to any other occurrence.  The last 
such chained occurence for each variable is passed down to each subquery, 
so that the subqueries can do the same unification process.

Whew.  That wasn't so bad.  Doing a quick review of a few ConQuer->SQL 
examples show that there's still a devil of a lot of details to get right, 
such as defaulting an RV in the theta_join if the start-DV is never given 
an RV, and tracking output variables from MAYBE outer joins.  However, the 
relational join structures created by these rules seem to closely mirror 
the SQL generated by ConQuer, at least for the examples I reviewed.  But, I 
haven't reviewed any queries involving aggregates, mainly because I haven't 
established a filter syntax for them, let alone a conversion algorithm.

Anyway, now we know (in principle) how to get from filters with optional 
correlations, to relational expressions, with the relational expressions 
being quite likely to having reasonable SQL translations, for any SQL 
dialect that permits correlated subqueries of the form 'IN (SELECT ...)', 
and ANSI SQL-92 outer joins.  AFAIK, these features are supported by at 
least Oracle, Sybase (12+) and PostgreSQL, but not by Gadfly or 
SQLite.  (Actually, if I understand correctly, SQLite can manage as long as 
all the outer joins are LEFT, and there are no correlations in the IN () 
subqueries.)

So, we're really cooking now.  All that's left to work out in the "theory" 
is aggregate handling and syntax.  At this point, I'm confident that the 
rest is basically going to be a set of programming problems and relatively 
minor design questions, rather than computer science problems!

I might note that one possible simplification of the query-builder 
mechanism for SQL generation or other purposes, might be to implement query 
builders that are actually back-end specific, as opposed to merely schema 
specific.  (A query builder needs a schema to understand types, joins, 
traversals, etc.)  However, such query builders wouldn't be able to take 
advantage of the "view" mechanisms described earlier in this series of 
posts, and would have to hand-map from traversals to possibly multiple 
joins.  On the other hand, such a back-end could maybe do much more 
fine-grained optimization, of a nature that might be needed for the project 
that's driving me to do this research in the first place...

For example, we have an app that frequently wants to find objects that are 
related to some object that's a child in a hierarchy.  Some of these 
hierarchies are small enough to cache in-memory, and thus to use a constant 
comparison like 'IN (1,27,543,291,...)' in the SQL, rather than using an 
'IN (SELECT ...)' subquery.  However, some databases have severe upper 
limits on the number of constants allowable in an 'IN()' clause.  If the 
SQL generator could check the size of the list involved (it's usually easy, 
because there's a specific hierarchy parent being matched), then it could 
decide whether to use a subselect or a constant list.  I don't know yet how 
important this will be, though.

Anyway, it's late once again, and I've now been working on this for close 
to 12 hours, so I'll close this one for today.