[TransWarp] Towards a query theory, part 4: RV operator optimization
Phillip J. Eby
pje at telecommunity.com
Wed Oct 15 16:09:01 EDT 2003
I found an interesting Java library today called XXL. It's a query library
that includes, among other things, a relational algebra framework and
both logical and physical query optimizers! It creates relational
operator trees similar to what I described in part 3.
So far, I've mainly been studying its logical optimizer, which does things
rather similar to what I'd envisioned. For example, it tries to "push
down" criteria close to their respective tables, as well as projections and
renames.
In some respects, though, it seems far more complicated than necessary, and
I'm not sure yet why. Some of the complexity comes from them trying to do
higher-order functional programming in Java (which is much easier in
Python). But then, much of the rest of the complexity seems to stem from
the optimizer not being written in a functional style! All of the actual
optimizer rules are described in functional terms, but the implementations
modify relational operator objects *in place*. I suspect this may be
intended to increase performance by reducing the amount of Java garbage
collection required, but it's hard to say for sure.
Another weird thing about their optimization rules is that they reiterate
some aspects in multiple passes, but there's no reason that I can see why
that part of the algorithm shouldn't execute either once, or a potentially
infinite number of times. For them to do it two or three times -
hardcoded! - makes no sense to me.
Of some 21 optimization rules, 7 of them look both safe and simple to me to
implement as part of relational operator constructors. They are basically
optimizations that stack redundant operators into single operators, like
converting select(select(r1,f1),f2) into select(r1,f1&f2). Their current
optimizer repeatedly reruns these 7 rules after certain other operations in
order to clean up after itself, so building these into the constructors
should make it unnecessary to manually reiterate them.
The overall optimizer strategy appears to be:
1. Separate criteria out of theta joins, converting them to
select(join(),criteria)
2. Combine all natural_join(R, natural_join(S,T)) nodes to natural_join(R,S,T)
3. Push all selection criteria down as close to the "real" tables as possible
4. Pull all projections as high in the tree as possible, combining
redundant ones along the way
5. Push all rename operations as low in the tree as possible
6. Recombine any redundant operations (e.g. join(join(..)),
select(select(..)), etc.
7. Push criteria back into theta joins, wherever there's a
select(theta_join(...),whatever), and recombine any adjacent theta_joins
8. Pull all criteria (that didn't just get blended into theta joins) as
high in the tree as possible. (Why????)
9. Repeat step 3, pushing all the criteria as low in the tree as
possible. (Wha???? We just pulled them back *up*!)
10. Repeat step 6, recombining redundant operations.
11. Repeat step 7, pushing criteria back into theta joins. (Why would we
now have any that couldn't have been pushed in there the last time?)
12. Push projections as low in the tree as possible, but don't push them
lower than selections. (That is, eliminate all columns we don't need, but
don't eliminate ones that are part of criteria.)
My guess at this point is that the weird steps 8-11 have something to do
with the way the optimizer modifies and/or replaces nodes in-place. Here's
what I think the overall intent is:
1. Selection is the first thing that happens to a table
2. Projection and renaming happens next
3. Joins and aggregations happen in whatever order is specified,
4. Redundant adjacent operators are flattened
Restated this way, it seems to me that it should be possible to embed the
optimization entirely in the relational operators' __new__ methods. Thus,
an operation like select() would check to see if its input was a select(),
and if so, create a combined select. If not, it would check whether its
input was a join, and if possible, recreate the join such that the criteria
passed down to the individual tables. If a rename, then it would recreate
the rename so that the renamed operator had the selection applied to
it. And so on.
Of course, rather than write lots of if-then tests, we could simply create
protocols like 'ISelectConstructor' to adapt the input operator to. Then,
we could implement that protocol differently for different operator types.
Indeed, it seems that we could likely make select(filter) and
project/rename (remap(), probably) simply be methods of RV objects. This
would let them "do the right thing" simply by delegating the method to
their child RV's where appropriate.
This approach is algorithmically very clean, and easily extended to new
types of RV operators, and has the advantage of being "always
optimized". That is, any given query is always in its logically optimal
form, as long as select and remap operations are done by calling the
methods on existing RV's, rather than explicitly creating selection or
remap RV's.
The disadvantage to this approach is that doing anything to an existing
query may recreate the entire relational operator structure, as criteria or
remaps are pushed downward. OTOH, as I've implied before, even the most
complex application queries will tend to involve a relatively small number
of RV's -- maybe dozens at most -- so this is unlikely to be a significant
performance impact overall.
In order to support this structure, logical expressions will also need to
be "optimized" to Conjunctive Normal Form (CNF). CNF basically means an
AND() of OR()s. Probably the logical expression interface will have a
'conjuncts()' method, and AND.conjuncts() will return its children, while
other logical expression types will return a list containing
themselves. Actually, OR.conjuncts() could check to see if there are any
objects in its children's conjuncts that appear in *all* its children's
conjuncts, and then elevate them to the level of a conjunct. This should
probably be done in OR()'s constructor, though. AND()'s constructor will
create an AND() containing its all its arguments' conjuncts.
Another feature that logical expressions (used in RV operators) will need,
is knowledge of all of the columns they (and/or their children) apply
to. This knowledge is needed in order to break up conjunctions to move
criteria downward. If a conjunct applies to columns in more than one table
of a join, it has to stay in a select() above the join. Logical
expressions need to be remappable as well, so that when criteria are pushed
down past remaps they can be rewritten to point to the right columns.
(Note, by the way, that in this entire message, "logical expressions"
should be considered to include comparison operator objects.)
I'm not sure where functions (on columns) should be applied, or how to fit
them into the overall logical structure. It seems they should be applied
as late as possible, to prevent calculation of the functions for rows that
are then discarded. However, they can't be moved above selections that
apply to them, or joins or aggregates that are based on them.
It's tempting to consider column functions part of remapping; it's a
natural fit from a logical perspective, especially since one normally wants
a function to create a new column, with a new name. But, they really want
to be as high in the tree as they can naturally go, which means we really
do need the IXXXConstructor scheme I mentioned earlier, and have __new__
methods adapt their parameters. We are likely to want to add new operator
types over time, and function operators will thus need to be adapted to
allow leapfrogging over those new operators where applicable. (That is,
the creator of a new operator type may need to define an adapter for the
function mapping type to a constructor protocol for the new operator type.)
For convenience's sake, the remap() and select() primitives should be part
of the IRemapConstructor and ISelectConstructor interfaces, and the
IRelationalOperator interface should derive from both, so all RV's can be
assumed to have them. This will make those constructors easier to implement.
Okay, I think that's enough on operator optimization. I think the general
architecture seems pretty do-able, and it should work cleanly with the
filter-to-query translation algorithm outlined in "Part 3: The Final
Algorithm?"
Among the tasks remaining are:
* Make sure we know the full set of operators that need implementing
(logical, comparative, and relational)
* Verify how parameters are handled by each class of relational operator
* Firm up the mechanism for identifying columns/DVs in the comparative and
relational operators
* Sketch an SQL-generation architecture for adapting relational operators
to DB-specific SQL-generation interfaces, or perhaps DB-specific operator
implementations.
* Refine the SQL-generation architecture to a "compiled query generation"
architecture, so that e.g. joining tables from two databases can result in
a Python join of two seperately-generated queries, or a query on one
database that is generated in part using data obtained by executing another
query.
* ???
* Profit!!!
P.S. For non-US readers, or at least non-watchers of South Park, the
???/Profit!!! above is a reference to a joke about the following
"fictional" business plan:
Step 1. Steal Underpants
Step 2. ???
Step 3. Profit!!!
Too bad that in real life, many business plans aren't actually much
different than the above... :)
More information about the PEAK
mailing list