[TransWarp] Towards a query theory, part 1: filters and correlation

Sun Oct 12 13:32:30 EDT 2003

I've been reviewing the conceptual query literature at:

http://www.orm.net/queries.html

for ideas towards a complete query mini-language usable within 
Python.  Here are my results so far.

Up until now, I've been viewing my "selectors" as a kind of filter object 
applied to some set of objects.  I realize now this is an incomplete 
model.  In truth, my "selectors" really need to be "variables", so they can 
be referenced in other parts of a query.  For example, here's a conceptual 
query from one of the papers (slightly edited):

Employee1
     + - lives in City1
     + - was born in Country1
     + - supervises Employee2
                       + - lives in City1
                       + - was born in Country2
                                          + - is not Country1

To represent this with my hypothetical "selectors" mechanism, it would have 
to be something like:

aCity = City.where()
aCountry = Country.where()

coResidentForeignSupervisors = Employee.where(
     livesIn = aCity,
     bornIn  = aCountry,
     supervises = Employee.where(
         livesIn = aCity,
         bornIn  = Country.where().isNot(aCountry)
     )
)

Here it becomes clear that selectors have object identity, and need to be 
referenceable, in order to construct correlations.  Also, that selectors 
should have "is/is not" comparison operators available, although the same 
result could be achieved by comparing primary keys.  For example, one could 
say Country.where(country_id = NOT_EQUAL(aCountry.country_id)), but note 
that *that* shows that selector attributes need to *also* be 
variables!  (Or, to look at it another way, even primitive types have 
selectors, e.g. Int.where() or String.where()!)

In short, analysis of just one query is enough to show that there are some 
major holes in my mental model of what exists in a query...  even a 
relatively "flat" one (e.g. no aggregate constructs).

Another concept that needs adding, is the idea of which variables are 
actually populated in returned results.  For simple object retrievals, this 
may not be necessary in that the "outermost" selector represents the thing 
you want to retrieve.  For reporting queries, one may need to specify 
arbitrary individual fields or at least multiple objects.  In this case, it 
seems that selectors need to be able to be told "you're a value I want to 
retrieve", perhaps by a 'show()' method that returns the same 
selector.  Thus, if we had wanted to list each supervisor paired with the 
supervised employee, we could've said:

coResidentForeignSupervisors = Employee.where(
     livesIn = aCity,
     bornIn  = aCountry,
     supervises = Employee.where(
         livesIn = aCity,
         bornIn  = Country.where().isNot(aCountry)
     ).show()
).show()

To indicate that the supervised employee should be shown along with the 
supervisor.  Maybe show would also take a parameter indicating the "column 
name" under which the result is returned, with sensible defaulting.

Maybe this whole syntax is wrong.  Consider...

aSupervisor = Employee.var()
aSubordinate = Employee.var()

filter = (
     aSupervisor.livesIn == aSubordinate.livesIn &
     aSubordinate in aSupervisor.supervises &
     aSubordinate.bornIn <> aSupervisor.bornIn
)

This representation is much closer to SQL, since one could presumably map 
it pretty directly to:

SELECT sup.*
   FROM Employee sup, Employee sub
  WHERE sub.supervisor = sup.id
    AND sup.livesIn = sub.livesIn
    AND sup.bornIn <> sub.bornIn

OTOH, mapping from the first syntax to the second was reasonably 
straightforward, and I am reasonably sure it's automatable.  Returning to 
our original example, let's annotate it with object identities:

aCity = City.where()         # V01
aCountry = Country.where()   # V02

supervisors = Employee.where(
     livesIn = aCity,
     bornIn  = aCountry,
     supervises = Employee.where(
         livesIn = aCity,
         bornIn  = Country.where().isNot(aCountry) # V05
     ) # V04
)  # V03

So, we have 5 selector variables, and the following things are true:

supervisors.livesIn is aCity
supervisors.bornIn is aCountry
supervisors.supervises.livesIn is aCity
supervisors.supervises.bornIn.isNot(aCountry)

We can also represent this as a graph:

  +----bornIn---> aCountry <---+
  |                            |
sup --livesIn--> aCity        |
  |               ^            |
supervises       |            |
  |               |            |
  v               |            |
V04 --livesIn----+            |
  |                            |
  + ---bornIn---> V05 --isNot--+

We can extract relationships by following paths from the start point, and 
creating a mapping of paths to objects and operators

/livesIn -> is(aCity)
/bornIn -> is(aCountry)
/supervises/livesIn -> is(aCity)
/supervises/bornIn -> isNot(aCountry)
/supervises -> V04

So, flipping this, we get:

aCity : /livesIn, /supervises/livesIn
aCountry : /bornIn, /supervises/bornIn (isNot)
V04: /supervises

And those are the three conditions that we expressed in the more SQL-like 
syntax.  In other words, when more than one structure path (livesIn, 
bornIn, etc.) points to a variable, there is an assertion that the values 
reached by the paths are equal.  Wherever a relational operator expresses a 
graph edge, it's an assertion of that relationship between the referenced 
objects.

Or to simplify it further...  two paths pointing to the same variable 
should be expanded to have an "equals" edge.  Then, you can simply extract 
all comparative edges from the graph, to obtain the entire list of logical 
conditions that apply.  That seems like a relatively clean and 
implementable algorithm, with appropriate design of the "variable" objects.

To deal with "or" and "not" operators as well as "and", we need to consider 
the logical structure as a tree of and/or/not interior nodes, with leaves 
that are the comparative operators in the first graph.  I don't have the 
slightest clue right now as to how to express that tree in the query 
syntax, let alone how to keep a hold of it through the transformations to SQL.

On the bright side, I believe I now have a solid mental model for 
expressing all non-aggregate relational query concepts.  Effectively, the 
structure is a logic tree (and/or/not/etc.) whose leaves are "attribute 
value assertions" (a path or value compared to another path or value), with 
a list of paths whose values should be retrieved.  Transformation to 
equivalent SQL flattens paths to joins, for the most part.

If no values are retrieved on the "far side" of a join, and the 
relationship is one-to-many, that portion of the path must be reduced to an 
EXISTS subquery, so that the same object on the "near side" isn't retrieved 
multiple times.  Such subqueries can then use joins freely, since the 
number of items they retrieved is irrelevant.

In SQL, paths are only single steps (table.field), so the query must 
introduce table aliases for each interior node of the path tree, beginning 
from the root.  Relational operators that refer to table aliases get 
repointed to the primary key of the table alias instead.

This basically ignores O-R mapping issues like column renames, multiple 
tables per class, etc.  However, I believe those can all be expressed as 
transformations on the path tree, such as renaming edges, inserting extra 
edges to reach other tables, and perhaps choosing alternate tables as path 
starts in order to avoid extra edges.  The logic tree and relational 
operators remain unchanged.  Data type conversions only occur on constant 
values in relational expressions.

Of course, knowing all this, and being able to write code to *do* it, are 
two different things, but at least this is a starting point.  In subsequent 
analyses, I'll be looking at how to incorporate more complex issues like 
aggregates, grouping, and outer joins.  Plus, I'm still not 100% happy with 
the syntax and spelling of the API, and it may be further influenced by 
introducing the more complex issues.

Indeed, at the moment I don't even know what object holds the logic tree 
for the relational expressions, although I suspect that each node in the 
path tree holds a logic tree for all relational expressions contained 
entirely within that path tree, or which were expressed directly upon that 
node.  For example, in 'Country.where().isNot(aCountry)', the resulting 
node holds the relational expression ('isNot') between the anonymous 
'Country.where()' and 'aCountry'.  This expression is then "and"-ed by the 
next highest containing node.

Interestingly, if we treat all of the nodes as components, and they bind to 
their "parent" (containing node), and whenever we assign a child node that 
already has a parent, we replace it with a new child node and create an 
equality (join) assertion between the new node and the old node.  This 
would let us use the existing binding framework to extract traversal paths, 
and simplify extraction of joins.  When a child node is attached, it is 
AND-ed with the parent's overall logical expression.

Pure-logic nodes (e.g. AND() and OR() objects) wouldn't attach subnodes to 
themselves.  Instead, they'd wait until they were placed in another 
expression, and tell all their child nodes to attach to that 
expression.  Thus, object/value nodes would always have a component path 
representing their data traversal path.

I'm not sure how to handle a NOT() node, although it would do the same 
parent/child manipulation to ensure correct paths.  Hmm.  Actually, I just 
realized that what's bothering me about NOT() is that we don't have a way 
to negate relationships, only conditions.  To put it another way, what's 
the difference between "X has a Y that does not meet condition Z", and "X 
does not have a Y that meets condition Z".  One is a join with a negated 
condition, and the other is a NOT EXISTS subquery.  We don't have a way to 
express the, latter unless we create a NOT_EXISTS() node type to wrap a 
selector in.  Such nodes would need to be directly converted to NOT-EXISTS 
subqueries, and retrieving any values from that branch of the path tree 
would be forbidden.  (Although a node that's shared with another subtree 
could still be retrieved.)

Having a NOT_EXISTS() seems rather like a kludge, though.  OTOH, we will 
probably also want a MAYBE() operator ala ConQuer, to express outer 
joins.  Thus, having such interior nodes to represent these specialized 
join types maybe isn't such a bad thing.  I think they would work like 
regular nodes, not logical nodes.  IOW, they would be part of their 
childrens' paths, like /x/NOT_EXISTS/y or /x/MAYBE/y.

Well, I think that's enough for one ramble session.  We now have a rough 
design for logical nodes (AND, OR, NOT), join nodes (NOT_EXISTS, MAYBE), 
selector nodes (type.where()), and relational operators (IS, IS_NOT, GT, 
LT, EQ, etc.).  We can now also more clearly restate our original example:

aCity = City.where()
aCountry = Country.where()

coResidentForeignSupervisors = Employee.where(
     livesIn = aCity,
     bornIn  = aCountry,
     supervises = Employee.where(
         livesIn = aCity,
         bornIn  = IS_NOT(aCountry)
     )
)

And indeed, I believe it's now possible for us to express in Python any 
ConQuer query that does not involve aggregates or ternary 
relationships.  There are many details still to be fleshed out, mostly 
well-formedness conditions such as "join nodes may not be root nodes", 
"selector children of a root logical node must share the same type", and so 
on.  The latter is so you can't say something like AND(Employee.where(...), 
Dog.where(...)).  Hmm.  I suppose that could make sense, if you wanted to 
find all employees that were also dogs...  Maybe we should just say that 
all selector children of a root logical node are given an IS() 
relationship, and treated as separate roots for table alias generation 
purposes.

Anyway, that kind of stuff can be fleshed out a bit more when we get closer 
to writing code, along with other issues like the best way to express that 
a value should be retrieved, and what name if any it should have in the 
retrieved result.