[TransWarp] Towards a query theory, part 2: building the trees
Phillip J. Eby
pje at telecommunity.com
Mon Oct 13 09:22:06 EDT 2003
In order to better understand how query trees should build themselves up, I
am going to introduce a new "simplified" query API. Simplified, in that it
offers *fewer* user-friendly features. Thus, we can discuss it with more
precision. Later, we can re-introduce the "friendly" syntax on top of this
First, some working terminology:
Filter -- a logical expression consisting of logical operators
(AND/OR/NOT), traversal operators (FEATURE/MAYBE/EXISTS), and comparison
Context -- a set whose members will be tested against a filter. A context
might be the entire contents of a database, all objects of a particular
type, or a single instance/value.
Implicit operand -- the context that a filter such as 'EQ(42)' applies
to. IOW, the context that 'EQ(42)' is intended to filter, is its implicit
Variable -- a name for a context. Variables are used to construct
"correlations", as in the earlier example. Variables are constructed with
'VAR(name,filter=None,show=False)', and may only be used on the right-hand
side of a comparison or traversal operator. The 'filter' parameter is
optional, used to express conditions that apply to the variable. The
'show' keyword can be used to indicate that this variable is a value that
should be retrieved.
Query -- the combination of a filter and a context, such that all implicit
operands in the filter are made explicit. For example, combining a context
such as the table 'Employee' with a filter such as
'FEATURE("livesIn",FEATURE("name",EQ("Chicago")))' should result in a query
structure like the following:
EQ( <context Employee/livesIn/name>, "Chicago" )
Where <context Employee/livesIn/name> is a context object created by
applying traversal operators to the base Employee context. Thus, a query
is essentially a filter whose subexpressions contain no implicit operands
and no traversal operators. However, for practical reasons, the query
object will probably be implemented as an object that keeps track of its
"root" contexts, since this is useful for SQL generation.
Filters are built up simply, via constructors similar to the ones existing
in peak.model.queries. There will probably be some minor optimizations to
allow elimination of duplicate subqueries, flattening of AND(AND(...)),
OR(OR(...)), NOT(NOT(...)), etc. (This may depend on whether it can be
done without messing up join context information.)
Once constructed, a filter will be usable for one of two purposes. It can
be applied to an individual object value to test if the object complies
with the filter, or, it can be used as a template to construct a query, by
applying it to a context.
Each node type will have a 'withContext()' method (or something similar),
which will return a copy of that node as it would be in the given
context. Logical operator nodes will return the same operator over
'withContext()' applied to their children. Traversal operators will return
their child expression's 'withContext()', but they will first traverse the
context to obtain a subcontext to supply to the child
expression. Comparison operators will return a version of themselves with
two explicit operands, the "left hand" operand being the supplied context.
Variables offer some interesting issues. When a variable is encountered in
an EQ() or IS() operator (or is implied to be so, by it being the
subexpression of a traversal operator), it may be considered a "defining"
occurrence of the variable. All other occurrences of the same variable
should be replaced with the context that the defining occurrence was
in. However, there is no guarantee that a "defining" occurrence will
happen before a non-defining usage, so variables that are used prior to
definition must be replaced with a special "forward-reference" context,
that will proxy the defining occurrence, once found. If no defining
occurrence of the variable is found, the forward-reference context will
issue an "undefined variable" error at execution time.
If a variable is marked with 'show', then its defining context will be
similarly marked. If a variable has a filter, then it applies its defining
context to the filter, and returns the redefined filter in place of the
The result of all these transformations should be a reconstructed filter
that is completely explicit, with all traversal operators removed from the
logical structure. As the structure is built up, each output expression
node tracks its context.
Generating Abstract SQL
To generate SQL from a query, we walk down the top-level logical
expression, examining each context as we go. Whenever we encounter a
non-leaf context (i.e. one that has children), we create a table alias for
it in the FROM clause, unless the context is an "existential" context. An
"existential" context is one created via EXISTS() or NOT_EXISTS() traversal
operators, or implicitly by a one-to-many feature traversal where no
subcontext of the context is marked "show".
If the context is "existential", we don't create a FROM alias. Instead, we
write an EXISTS() or NOT EXISTS () subquery to the WHERE clause, as
appropriate, and generate a new SQL template. When we traverse to
subexpressions, we will generate any new aliases in the nested query's FROM
clause, while still keeping the table alias names unique over the top-level
query (because the nested queries may refer to outer queries' table
aliases, if variables are in use).
If the context is not existential, we create a table alias in the current
(sub)query, and write the logical or comparison operator and its operands
to the WHERE clause.
If the context is marked "show", an appropriate alias should be added to
the SELECT clause.
When an individual (sub)query is finished, its WHERE clause should be ANDed
with any join conditions needed to enact the traversals from the parent
context of the (sub)query to the defined table aliases. Alternatively,
join conditions may be added to the FROM clause as we go along, if the
database supports ANSI join syntax (e.g. INNER JOIN blah ON foo).
As you can see, all of these descriptions are still quite vague, and don't
precisely specify many of the data structures needed. But they're a good
start for now, and should be able to be refined more in future. The idea
of a "context" in particular needs much refinement, as contexts will
probably be very type- and/or database-specific, as they will need
knowledge of things like multiplicity and type of their features,
etc. It's also likely that some of the responsibilities I've ascribed here
to contexts will actually be handled by query nodes, in order to minimize
the communication required across the different objects.
For example, if query nodes track the presence of child contexts and
visible child contexts, this would allow contexts to be immutable and only
keep pointers to their parent contexts. This is useful because query nodes
are constructed "up" the tree (assembling lower-level pieces), but contexts
are created "down" the tree (lower-level contexts are a function of higher
contexts). So, if each only keeps links in the opposite direction, both
structures can be essentially immutable, using purely functional algorithms
and avoiding circular references that would slow garbage collection. More
importantly, it would be impossible to accidentally share nodes between
queries in a damaging way.
In my next installment, I'll probably either start narrowing this stuff
down to some interface drafts, or else take a look at aggregates and
grouping. Which may not be as hard as I thought... or may blow the current
model out of the water. Stay tuned for the exciting conclusion! :)
More information about the PEAK