[TransWarp] Towards a query theory, part 2: building the trees

Mon Oct 13 09:22:06 EDT 2003

In order to better understand how query trees should build themselves up, I 
am going to introduce a new "simplified" query API.  Simplified, in that it 
offers *fewer* user-friendly features.  Thus, we can discuss it with more 
precision.  Later, we can re-introduce the "friendly" syntax on top of this 
base.

First, some working terminology:

Filter -- a logical expression consisting of logical operators 
(AND/OR/NOT), traversal operators (FEATURE/MAYBE/EXISTS), and comparison 
operators (IS/EQ/GT/LT/etc.).

Context -- a set whose members will be tested against a filter.  A context 
might be the entire contents of a database, all objects of a particular 
type, or a single instance/value.

Implicit operand -- the context that a filter such as 'EQ(42)' applies 
to.  IOW, the context that 'EQ(42)' is intended to filter, is its implicit 
operand.

Variable -- a name for a context.  Variables are used to construct 
"correlations", as in the earlier example.  Variables are constructed with 
'VAR(name,filter=None,show=False)', and may only be used on the right-hand 
side of a comparison or traversal operator.  The 'filter' parameter is 
optional, used to express conditions that apply to the variable.  The 
'show' keyword can be used to indicate that this variable is a value that 
should be retrieved.

Query -- the combination of a filter and a context, such that all implicit 
operands in the filter are made explicit.  For example, combining a context 
such as the table 'Employee' with a filter such as 
'FEATURE("livesIn",FEATURE("name",EQ("Chicago")))' should result in a query 
structure like the following:

    EQ( <context Employee/livesIn/name>, "Chicago" )

Where <context Employee/livesIn/name> is a context object created by 
applying traversal operators to the base Employee context.  Thus, a query 
is essentially a filter whose subexpressions contain no implicit operands 
and no traversal operators.  However, for practical reasons, the query 
object will probably be implemented as an object that keeps track of its 
"root" contexts, since this is useful for SQL generation.

Constructing Queries
--------------------

Filters are built up simply, via constructors similar to the ones existing 
in peak.model.queries.  There will probably be some minor optimizations to 
allow elimination of duplicate subqueries, flattening of AND(AND(...)), 
OR(OR(...)), NOT(NOT(...)), etc.  (This may depend on whether it can be 
done without messing up join context information.)

Once constructed, a filter will be usable for one of two purposes.  It can 
be applied to an individual object value to test if the object complies 
with the filter, or, it can be used as a template to construct a query, by 
applying it to a context.

Each node type will have a 'withContext()' method (or something similar), 
which will return a copy of that node as it would be in the given 
context.  Logical operator nodes will return the same operator over 
'withContext()' applied to their children.  Traversal operators will return 
their child expression's 'withContext()', but they will first traverse the 
context to obtain a subcontext to supply to the child 
expression.  Comparison operators will return a version of themselves with 
two explicit operands, the "left hand" operand being the supplied context.

Variables offer some interesting issues.  When a variable is encountered in 
an EQ() or IS() operator (or is implied to be so, by it being the 
subexpression of a traversal operator), it may be considered a "defining" 
occurrence of the variable.  All other occurrences of the same variable 
should be replaced with the context that the defining occurrence was 
in.  However, there is no guarantee that a "defining" occurrence will 
happen before a non-defining usage, so variables that are used prior to 
definition must be replaced with a special "forward-reference" context, 
that will proxy the defining occurrence, once found.  If no defining 
occurrence of the variable is found, the forward-reference context will 
issue an "undefined variable" error at execution time.

If a variable is marked with 'show', then its defining context will be 
similarly marked.  If a variable has a filter, then it applies its defining 
context to the filter, and returns the redefined filter in place of the 
defining context.

The result of all these transformations should be a reconstructed filter 
that is completely explicit, with all traversal operators removed from the 
logical structure.  As the structure is built up, each output expression 
node tracks its context.

Generating Abstract SQL
-----------------------

To generate SQL from a query, we walk down the top-level logical 
expression, examining each context as we go.  Whenever we encounter a 
non-leaf context (i.e. one that has children), we create a table alias for 
it in the FROM clause, unless the context is an "existential" context.  An 
"existential" context is one created via EXISTS() or NOT_EXISTS() traversal 
operators, or implicitly by a one-to-many feature traversal where no 
subcontext of the context is marked "show".

If the context is "existential", we don't create a FROM alias.  Instead, we 
write an EXISTS() or NOT EXISTS () subquery to the WHERE clause, as 
appropriate, and generate a new SQL template.  When we traverse to 
subexpressions, we will generate any new aliases in the nested query's FROM 
clause, while still keeping the table alias names unique over the top-level 
query (because the nested queries may refer to outer queries' table 
aliases, if variables are in use).

If the context is not existential, we create a table alias in the current 
(sub)query, and write the logical or comparison operator and its operands 
to the WHERE clause.

If the context is marked "show", an appropriate alias should be added to 
the SELECT clause.

When an individual (sub)query is finished, its WHERE clause should be ANDed 
with any join conditions needed to enact the traversals from the parent 
context of the (sub)query to the defined table aliases.  Alternatively, 
join conditions may be added to the FROM clause as we go along, if the 
database supports ANSI join syntax (e.g. INNER JOIN blah ON foo).

As you can see, all of these descriptions are still quite vague, and don't 
precisely specify many of the data structures needed.  But they're a good 
start for now, and should be able to be refined more in future.  The idea 
of a "context" in particular needs much refinement, as contexts will 
probably be very type- and/or database-specific, as they will need 
knowledge of things like multiplicity and type of their features, 
etc.  It's also likely that some of the responsibilities I've ascribed here 
to contexts will actually be handled by query nodes, in order to minimize 
the communication required across the different objects.

For example, if query nodes track the presence of child contexts and 
visible child contexts, this would allow contexts to be immutable and only 
keep pointers to their parent contexts.  This is useful because query nodes 
are constructed "up" the tree (assembling lower-level pieces), but contexts 
are created "down" the tree (lower-level contexts are a function of higher 
contexts).  So, if each only keeps links in the opposite direction, both 
structures can be essentially immutable, using purely functional algorithms 
and avoiding circular references that would slow garbage collection.  More 
importantly, it would be impossible to accidentally share nodes between 
queries in a damaging way.

In my next installment, I'll probably either start narrowing this stuff 
down to some interface drafts, or else take a look at aggregates and 
grouping.  Which may not be as hard as I thought... or may blow the current 
model out of the water.  Stay tuned for the exciting conclusion!  :)