[PEAK] Minor design/plan changes in peak.query.algebra

Wed Oct 22 19:11:29 EDT 2003

There are some changes that are going to happen to the planned optimization 
algorithms for the relational algebra in peak.query.

Previously, I'd said that we want to push select() and project()/remap() 
operations as low in the tree as possible, in order to optimize the 
queries.  After experimenting today with the project() operation and table 
columns, while contemplating SQL generation, it seems to me that this type 
of pushdown is actually a premature optimization where SQL is 
concerned.  The peak.query.algebra.BasicJoin class I've been working on is 
actually a pretty decent model for a single SELECT statement, if the 
select() and project() data is kept at that level.  Pushing it down to 
individual tables would just mean an SQL generator would have to bring 
those things back *up* to the SELECT statement level.

For cross-database joins, it should be straighforward for the join 
evaluator to pull out the project/select criteria for each underlying table 
and pass it down, so there seems to be little point in forcing them down 
ahead of time.  While the query framework is intended to work with other 
query languages, I think it's reasonable to expect that most of the time 
its target is going to be SQL, so it seems okay to leave this optimization 
as an extra step for non-SQL and cross-DB processors.

Based on my observations so far, the current peak.query.algebra design 
should be capable of representing any SQL query that does not contain 
aggregation, functions, or outer joins, using only a single BasicJoin 
object as the top-level structure.  (Note: I said current *design*, not 
implementation.  Comparisons and subqueries aren't available yet, and 
BasicJoin doesn't have a projection capability built in yet 
either.)  Anyway, queries of this form should cover an awful lot of 
non-reporting needs.

Here's a rough outline of where I'll probably go from here in implementing 
this:

* Trivial SELECT cols FROM tables WHERE conditions
* Prototype SQL generation
* Functions
* Subqueries
* Variables
* Parameters
* Aggregates
* Translation from "conceptual" queries

For every item after "prototype SQL generation", I'll probably be updating 
the prototyped SQL generation to make sure it can still handle the newly 
added features.

At this point I'm assuming that the failsafe route is to ensure that I have 
all needed functionality before trying to develop the conceptual query 
module.  I'm not positive this is right, since it's not clear that this 
will give the relational algebra framework a clean API from the conceptual 
query POV.  Also, it's possible that the conceptual query framework might 
be able to provide information that would simplify the relational algebra 
formulations, that isn't available from a pure algebra POV.  But, so far 
the algebra framework is so small, that I probably shouldn't worry too much 
about refactoring it later.