[TransWarp] Towards a query theory, part 4: RV operator optimization

Wed Oct 15 16:09:01 EDT 2003

I found an interesting Java library today called XXL.  It's a query library 
that includes, among other things, a relational algebra framework and 
both  logical and physical query optimizers!  It creates relational 
operator trees similar to what I described in part 3.

So far, I've mainly been studying its logical optimizer, which does things 
rather similar to what I'd envisioned.  For example, it tries to "push 
down" criteria close to their respective tables, as well as projections and 
renames.

In some respects, though, it seems far more complicated than necessary, and 
I'm not sure yet why.  Some of the complexity comes from them trying to do 
higher-order functional programming in Java (which is much easier in 
Python).  But then, much of the rest of the complexity seems to stem from 
the optimizer not being written in a functional style!  All of the actual 
optimizer rules are described in functional terms, but the implementations 
modify relational operator objects *in place*.  I suspect this may be 
intended to increase performance by reducing the amount of Java garbage 
collection required, but it's hard to say for sure.

Another weird thing about their optimization rules is that they reiterate 
some aspects in multiple passes, but there's no reason that I can see why 
that part of the algorithm shouldn't execute either once, or a potentially 
infinite number of times.  For them to do it two or three times - 
hardcoded! - makes no sense to me.

Of some 21 optimization rules, 7 of them look both safe and simple to me to 
implement as part of relational operator constructors.  They are basically 
optimizations that stack redundant operators into single operators, like 
converting select(select(r1,f1),f2) into select(r1,f1&f2).  Their current 
optimizer repeatedly reruns these 7 rules after certain other operations in 
order to clean up after itself, so building these into the constructors 
should make it unnecessary to manually reiterate them.

The overall optimizer strategy appears to be:

1. Separate criteria out of theta joins, converting them to 
select(join(),criteria)

2. Combine all natural_join(R, natural_join(S,T)) nodes to natural_join(R,S,T)

3. Push all selection criteria down as close to the "real" tables as possible

4. Pull all projections as high in the tree as possible, combining 
redundant ones along the way

5. Push all rename operations as low in the tree as possible

6. Recombine any redundant operations (e.g. join(join(..)), 
select(select(..)), etc.

7. Push criteria back into theta joins, wherever there's a 
select(theta_join(...),whatever), and recombine any adjacent theta_joins

8. Pull all criteria (that didn't just get blended into theta joins) as 
high in the tree as possible.  (Why????)

9. Repeat step 3, pushing all the criteria as low in the tree as 
possible.  (Wha????  We just pulled them back *up*!)

10. Repeat step 6, recombining redundant operations.

11. Repeat step 7, pushing criteria back into theta joins.  (Why would we 
now have any that couldn't have been pushed in there the last time?)

12. Push projections as low in the tree as possible, but don't push them 
lower than selections.  (That is, eliminate all columns we don't need, but 
don't eliminate ones that are part of criteria.)

My guess at this point is that the weird steps 8-11 have something to do 
with the way the optimizer modifies and/or replaces nodes in-place.  Here's 
what I think the overall intent is:

1. Selection is the first thing that happens to a table
2. Projection and renaming happens next
3. Joins and aggregations happen in whatever order is specified,
4. Redundant adjacent operators are flattened

Restated this way, it seems to me that it should be possible to embed the 
optimization entirely in the relational operators' __new__ methods.  Thus, 
an operation like select() would check to see if its input was a select(), 
and if so, create a combined select.  If not, it would check whether its 
input was a join, and if possible, recreate the join such that the criteria 
passed down to the individual tables.  If a rename, then it would recreate 
the rename so that the renamed operator had the selection applied to 
it.  And so on.

Of course, rather than write lots of if-then tests, we could simply create 
protocols like 'ISelectConstructor' to adapt the input operator to.  Then, 
we could implement that protocol differently for different operator types.

Indeed, it seems that we could likely make select(filter) and 
project/rename (remap(), probably) simply be methods of RV objects.  This 
would let them "do the right thing" simply by delegating the method to 
their child RV's where appropriate.

This approach is algorithmically very clean, and easily extended to new 
types of RV operators, and has the advantage of being "always 
optimized".  That is, any given query is always in its logically optimal 
form, as long as select and remap operations are done by calling the 
methods on existing RV's, rather than explicitly creating selection or 
remap RV's.

The disadvantage to this approach is that doing anything to an existing 
query may recreate the entire relational operator structure, as criteria or 
remaps are pushed downward.  OTOH, as I've implied before, even the most 
complex application queries will tend to involve a relatively small number 
of RV's -- maybe dozens at most -- so this is unlikely to be a significant 
performance impact overall.

In order to support this structure, logical expressions will also need to 
be "optimized" to Conjunctive Normal Form (CNF).  CNF basically means an 
AND() of OR()s.  Probably the logical expression interface will have a 
'conjuncts()' method, and AND.conjuncts() will return its children, while 
other logical expression types will return a list containing 
themselves.  Actually, OR.conjuncts() could check to see if there are any 
objects in its children's conjuncts that appear in *all* its children's 
conjuncts, and then elevate them to the level of a conjunct.  This should 
probably be done in OR()'s constructor, though.  AND()'s constructor will 
create an AND() containing its all its arguments' conjuncts.

Another feature that logical expressions (used in RV operators) will need, 
is knowledge of all of the columns they (and/or their children) apply 
to.  This knowledge is needed in order to break up conjunctions to move 
criteria downward.  If a conjunct applies to columns in more than one table 
of a join, it has to stay in a select() above the join.  Logical 
expressions need to be remappable as well, so that when criteria are pushed 
down past remaps they can be rewritten to point to the right columns. 
(Note, by the way, that in this entire message, "logical expressions" 
should be considered to include comparison operator objects.)

I'm not sure where functions (on columns) should be applied, or how to fit 
them into the overall logical structure.  It seems they should be applied 
as late as possible, to prevent calculation of the functions for rows that 
are then discarded.  However, they can't be moved above selections that 
apply to them, or joins or aggregates that are based on them.

It's tempting to consider column functions part of remapping; it's a 
natural fit from a logical perspective, especially since one normally wants 
a function to create a new column, with a new name.  But, they really want 
to be as high in the tree as they can naturally go, which means we really 
do need the IXXXConstructor scheme I mentioned earlier, and have __new__ 
methods adapt their parameters.  We are likely to want to add new operator 
types over time, and function operators will thus need to be adapted to 
allow leapfrogging over those new operators where applicable.  (That is, 
the creator of a new operator type may need to define an adapter for the 
function mapping type to a constructor protocol for the new operator type.)

For convenience's sake, the remap() and select() primitives should be part 
of the IRemapConstructor and ISelectConstructor interfaces, and the 
IRelationalOperator interface should derive from both, so all RV's can be 
assumed to have them.  This will make those constructors easier to implement.

Okay, I think that's enough on operator optimization.  I think the general 
architecture seems pretty do-able, and it should work cleanly with the 
filter-to-query translation algorithm outlined in "Part 3: The Final 
Algorithm?"

Among the tasks remaining are:

* Make sure we know the full set of operators that need implementing 
(logical, comparative, and relational)

* Verify how parameters are handled by each class of relational operator

* Firm up the mechanism for identifying columns/DVs in the comparative and 
relational operators

* Sketch an SQL-generation architecture for adapting relational operators 
to DB-specific SQL-generation interfaces, or perhaps DB-specific operator 
implementations.

* Refine the SQL-generation architecture to a "compiled query generation" 
architecture, so that e.g. joining tables from two databases can result in 
a Python join of two seperately-generated queries, or a query on one 
database that is generated in part using data obtained by executing another 
query.

* ???

* Profit!!!

P.S. For non-US readers, or at least non-watchers of South Park, the 
???/Profit!!! above is a reference to a joke about the following 
"fictional" business plan:

Step 1. Steal Underpants
Step 2. ???
Step 3. Profit!!!

Too bad that in real life, many business plans aren't actually much 
different than the above...  :)