[TransWarp] Towards a query theory, part 6: building queries from filters
Phillip J. Eby
pje at telecommunity.com
Thu Oct 16 22:29:53 EDT 2003
It's been a long day -- alternating between studying scripture (i.e.
academic writings on relational DB's, joins, queries, and optimization) and
testing my faith (by conducting experiments). But I think I have finally
seen the light (figured out the darn algorithm).
Here goes. First, filter operators:
Logical: AND, OR, NOT
Variables: VAR
Comparators: EQ, GT, LT, NE, etc.
Traversal: MAYBE (outer join), FEATURE, FUNCTION
These are the only operators. I'm ignoring aggregation and group-by for
this first pass.
A "query builder" is an object that constructs a relational expression,
according to instructions given it by a filter. A query builder
essentially holds one large theta-join construct, augmented with a list of
peer outer joins (to other query builders). After exhaustive (and
exhausting) study, I've concluded that this is adequate to model all join
constructs that are possible given the filter operators above. (MAYBE
represents an outer join to its entire subtree, so any nested MAYBE's are
performed first, meaning that they can be contained in the parent query's
MAYBE list.)
Rather than using EXISTS(SELECT ...) for subqueries, we will use IN(SELECT
...). This appears to be enormously easier to do correctly.
Query builders are initialized with a starting object type. The query
builder has to know for a given object type, what fact types
(relationships, functions, etc.) exist, and what relational construct they
map to, in the form of an RV. Query builders also keep track of DV's, and
whether they're shown, labelled with a correlation name, etc.
Initially, the builder has neither any RV's or DV's. To begin building a
query, you ask the builder for a DV of the desired starting object
type. You then supply the builder and the DV to the buildQuery method of
your initial filter node. The filter returns a logical expression, that
you then pass to the select() method of the query builder, in order to
receive your relational query expression. (Which you then may build SQL or
something else from.)
Each node type has different behavior for buildQuery(). AND() just returns
the AND() of the logical expressions of its subnodes. OR(), NOT(), and
MAYBE() ask the builder for a new DV that is flagged as requiring any joins
to be expressed as a subquery of appropriate type. (More on this
later.) OR() and NOT() then return the OR or NOT of an
'IN(oldDV,subqueryForChild)' condition for each child's subquery, if the
child did any joins. MAYBE() returns no logical expression, as it just
finishes off its child's subquery and adds it to the current query
builder's outer join list.
FEATURE() and FUNCTION() call the query builder's 'traverse(oldDV,path)'
method to receive a new DV. If the 'oldDV' is bound to an RV column, and
the desired new DV is available in the same (virtual) table, no new RV is
created. Instead, a DV bound to one of the existing RV's columns is
returned. However, if the current DV is bound to a different table/column
than the desired traversal requires, the appropriate base table (or view,
whatever) is remapped to create a table alias, join criteria are added to
the query, and one of the new RV's columns is returned as the new DV.
Anyway, once a feature or function gets a new DV, it just passes it along
to its subexpression, and it returns whatever logical expression the
subexpression returns.
So back to subquery DV's. A subquery DV has some extra state that says to
the query builder, "if you use me in a join, create the RV in this new
subquery, and return the new DV from there instead of the parent
query." The tricky bit is that I'm not sure how to communicate to the new
DV's user that they need to return an appropriate IN() expression to join
it up.
Bleah. This post is messy and disorganized, and the algorithm isn't much
better... BUT, the problems in this algorithm are of a much different
nature than what we've had before. They are now practical considerations
of how to divide up responsibilities between some objects, as opposed to
wide-open theory questions. I now have some whiteboard drawings that
convey the structural mechanics of what I want this thing to do. And, I
could probably describe this algorithm more cleanly if I just wrote up
transformation rules rather than trying to describe all the state tracking
mechanisms needed to implement the rules. So here goes...
* OR() and NOT() subtrees may not contain output variables; therefore any
joins occurring under them must be converted to IN() subqueries, as is true
for any one-to-many join where the subtree contains no output variables,
but the current query does. (The latter restriction prevents the subquery
from breaking itself into further subqueries unless another OR() or NOT()
is encountered.)
* MAYBE() converts its entire subtree into an RV that is outer-joined with
the query that the MAYBE is immediately contained in. (Unless the MAYBE
subtree has no joins, in which case the contained criteria simply bubble up
to the containing query.)
* All other joins simply add a new RV and join criteria to the current
subtree, which is in relational terms a theta join.
* A completed query builder can convert itself to an RV by applying
remap(), select(), and outer_join() as appropriate to the theta_join() of
its contained RV's and join criteria.
* Comparisons get their implicit operand from the current DV. An equality
comparison to a variable, "labels" the current DV with that variable name.
* Correlation variable unification is done by each query builder before
conversion to an RV. Basically, this consists of substituting each
occurrence of the variable with a reference to some other occurrence, such
that there is a path from any occurence to any other occurrence. The last
such chained occurence for each variable is passed down to each subquery,
so that the subqueries can do the same unification process.
Whew. That wasn't so bad. Doing a quick review of a few ConQuer->SQL
examples show that there's still a devil of a lot of details to get right,
such as defaulting an RV in the theta_join if the start-DV is never given
an RV, and tracking output variables from MAYBE outer joins. However, the
relational join structures created by these rules seem to closely mirror
the SQL generated by ConQuer, at least for the examples I reviewed. But, I
haven't reviewed any queries involving aggregates, mainly because I haven't
established a filter syntax for them, let alone a conversion algorithm.
Anyway, now we know (in principle) how to get from filters with optional
correlations, to relational expressions, with the relational expressions
being quite likely to having reasonable SQL translations, for any SQL
dialect that permits correlated subqueries of the form 'IN (SELECT ...)',
and ANSI SQL-92 outer joins. AFAIK, these features are supported by at
least Oracle, Sybase (12+) and PostgreSQL, but not by Gadfly or
SQLite. (Actually, if I understand correctly, SQLite can manage as long as
all the outer joins are LEFT, and there are no correlations in the IN ()
subqueries.)
So, we're really cooking now. All that's left to work out in the "theory"
is aggregate handling and syntax. At this point, I'm confident that the
rest is basically going to be a set of programming problems and relatively
minor design questions, rather than computer science problems!
I might note that one possible simplification of the query-builder
mechanism for SQL generation or other purposes, might be to implement query
builders that are actually back-end specific, as opposed to merely schema
specific. (A query builder needs a schema to understand types, joins,
traversals, etc.) However, such query builders wouldn't be able to take
advantage of the "view" mechanisms described earlier in this series of
posts, and would have to hand-map from traversals to possibly multiple
joins. On the other hand, such a back-end could maybe do much more
fine-grained optimization, of a nature that might be needed for the project
that's driving me to do this research in the first place...
For example, we have an app that frequently wants to find objects that are
related to some object that's a child in a hierarchy. Some of these
hierarchies are small enough to cache in-memory, and thus to use a constant
comparison like 'IN (1,27,543,291,...)' in the SQL, rather than using an
'IN (SELECT ...)' subquery. However, some databases have severe upper
limits on the number of constants allowable in an 'IN()' clause. If the
SQL generator could check the size of the list involved (it's usually easy,
because there's a specific hierarchy parent being matched), then it could
decide whether to use a subselect or a constant list. I don't know yet how
important this will be, though.
Anyway, it's late once again, and I've now been working on this for close
to 12 hours, so I'll close this one for today.
More information about the PEAK
mailing list