[TransWarp] Towards a query theory, part 1: filters and correlation
Phillip J. Eby
pje at telecommunity.com
Sun Oct 12 13:32:30 EDT 2003
I've been reviewing the conceptual query literature at:
http://www.orm.net/queries.html
for ideas towards a complete query mini-language usable within
Python. Here are my results so far.
Up until now, I've been viewing my "selectors" as a kind of filter object
applied to some set of objects. I realize now this is an incomplete
model. In truth, my "selectors" really need to be "variables", so they can
be referenced in other parts of a query. For example, here's a conceptual
query from one of the papers (slightly edited):
Employee1
+ - lives in City1
+ - was born in Country1
+ - supervises Employee2
+ - lives in City1
+ - was born in Country2
+ - is not Country1
To represent this with my hypothetical "selectors" mechanism, it would have
to be something like:
aCity = City.where()
aCountry = Country.where()
coResidentForeignSupervisors = Employee.where(
livesIn = aCity,
bornIn = aCountry,
supervises = Employee.where(
livesIn = aCity,
bornIn = Country.where().isNot(aCountry)
)
)
Here it becomes clear that selectors have object identity, and need to be
referenceable, in order to construct correlations. Also, that selectors
should have "is/is not" comparison operators available, although the same
result could be achieved by comparing primary keys. For example, one could
say Country.where(country_id = NOT_EQUAL(aCountry.country_id)), but note
that *that* shows that selector attributes need to *also* be
variables! (Or, to look at it another way, even primitive types have
selectors, e.g. Int.where() or String.where()!)
In short, analysis of just one query is enough to show that there are some
major holes in my mental model of what exists in a query... even a
relatively "flat" one (e.g. no aggregate constructs).
Another concept that needs adding, is the idea of which variables are
actually populated in returned results. For simple object retrievals, this
may not be necessary in that the "outermost" selector represents the thing
you want to retrieve. For reporting queries, one may need to specify
arbitrary individual fields or at least multiple objects. In this case, it
seems that selectors need to be able to be told "you're a value I want to
retrieve", perhaps by a 'show()' method that returns the same
selector. Thus, if we had wanted to list each supervisor paired with the
supervised employee, we could've said:
coResidentForeignSupervisors = Employee.where(
livesIn = aCity,
bornIn = aCountry,
supervises = Employee.where(
livesIn = aCity,
bornIn = Country.where().isNot(aCountry)
).show()
).show()
To indicate that the supervised employee should be shown along with the
supervisor. Maybe show would also take a parameter indicating the "column
name" under which the result is returned, with sensible defaulting.
Maybe this whole syntax is wrong. Consider...
aSupervisor = Employee.var()
aSubordinate = Employee.var()
filter = (
aSupervisor.livesIn == aSubordinate.livesIn &
aSubordinate in aSupervisor.supervises &
aSubordinate.bornIn <> aSupervisor.bornIn
)
This representation is much closer to SQL, since one could presumably map
it pretty directly to:
SELECT sup.*
FROM Employee sup, Employee sub
WHERE sub.supervisor = sup.id
AND sup.livesIn = sub.livesIn
AND sup.bornIn <> sub.bornIn
OTOH, mapping from the first syntax to the second was reasonably
straightforward, and I am reasonably sure it's automatable. Returning to
our original example, let's annotate it with object identities:
aCity = City.where() # V01
aCountry = Country.where() # V02
supervisors = Employee.where(
livesIn = aCity,
bornIn = aCountry,
supervises = Employee.where(
livesIn = aCity,
bornIn = Country.where().isNot(aCountry) # V05
) # V04
) # V03
So, we have 5 selector variables, and the following things are true:
supervisors.livesIn is aCity
supervisors.bornIn is aCountry
supervisors.supervises.livesIn is aCity
supervisors.supervises.bornIn.isNot(aCountry)
We can also represent this as a graph:
+----bornIn---> aCountry <---+
| |
sup --livesIn--> aCity |
| ^ |
supervises | |
| | |
v | |
V04 --livesIn----+ |
| |
+ ---bornIn---> V05 --isNot--+
We can extract relationships by following paths from the start point, and
creating a mapping of paths to objects and operators
/livesIn -> is(aCity)
/bornIn -> is(aCountry)
/supervises/livesIn -> is(aCity)
/supervises/bornIn -> isNot(aCountry)
/supervises -> V04
So, flipping this, we get:
aCity : /livesIn, /supervises/livesIn
aCountry : /bornIn, /supervises/bornIn (isNot)
V04: /supervises
And those are the three conditions that we expressed in the more SQL-like
syntax. In other words, when more than one structure path (livesIn,
bornIn, etc.) points to a variable, there is an assertion that the values
reached by the paths are equal. Wherever a relational operator expresses a
graph edge, it's an assertion of that relationship between the referenced
objects.
Or to simplify it further... two paths pointing to the same variable
should be expanded to have an "equals" edge. Then, you can simply extract
all comparative edges from the graph, to obtain the entire list of logical
conditions that apply. That seems like a relatively clean and
implementable algorithm, with appropriate design of the "variable" objects.
To deal with "or" and "not" operators as well as "and", we need to consider
the logical structure as a tree of and/or/not interior nodes, with leaves
that are the comparative operators in the first graph. I don't have the
slightest clue right now as to how to express that tree in the query
syntax, let alone how to keep a hold of it through the transformations to SQL.
On the bright side, I believe I now have a solid mental model for
expressing all non-aggregate relational query concepts. Effectively, the
structure is a logic tree (and/or/not/etc.) whose leaves are "attribute
value assertions" (a path or value compared to another path or value), with
a list of paths whose values should be retrieved. Transformation to
equivalent SQL flattens paths to joins, for the most part.
If no values are retrieved on the "far side" of a join, and the
relationship is one-to-many, that portion of the path must be reduced to an
EXISTS subquery, so that the same object on the "near side" isn't retrieved
multiple times. Such subqueries can then use joins freely, since the
number of items they retrieved is irrelevant.
In SQL, paths are only single steps (table.field), so the query must
introduce table aliases for each interior node of the path tree, beginning
from the root. Relational operators that refer to table aliases get
repointed to the primary key of the table alias instead.
This basically ignores O-R mapping issues like column renames, multiple
tables per class, etc. However, I believe those can all be expressed as
transformations on the path tree, such as renaming edges, inserting extra
edges to reach other tables, and perhaps choosing alternate tables as path
starts in order to avoid extra edges. The logic tree and relational
operators remain unchanged. Data type conversions only occur on constant
values in relational expressions.
Of course, knowing all this, and being able to write code to *do* it, are
two different things, but at least this is a starting point. In subsequent
analyses, I'll be looking at how to incorporate more complex issues like
aggregates, grouping, and outer joins. Plus, I'm still not 100% happy with
the syntax and spelling of the API, and it may be further influenced by
introducing the more complex issues.
Indeed, at the moment I don't even know what object holds the logic tree
for the relational expressions, although I suspect that each node in the
path tree holds a logic tree for all relational expressions contained
entirely within that path tree, or which were expressed directly upon that
node. For example, in 'Country.where().isNot(aCountry)', the resulting
node holds the relational expression ('isNot') between the anonymous
'Country.where()' and 'aCountry'. This expression is then "and"-ed by the
next highest containing node.
Interestingly, if we treat all of the nodes as components, and they bind to
their "parent" (containing node), and whenever we assign a child node that
already has a parent, we replace it with a new child node and create an
equality (join) assertion between the new node and the old node. This
would let us use the existing binding framework to extract traversal paths,
and simplify extraction of joins. When a child node is attached, it is
AND-ed with the parent's overall logical expression.
Pure-logic nodes (e.g. AND() and OR() objects) wouldn't attach subnodes to
themselves. Instead, they'd wait until they were placed in another
expression, and tell all their child nodes to attach to that
expression. Thus, object/value nodes would always have a component path
representing their data traversal path.
I'm not sure how to handle a NOT() node, although it would do the same
parent/child manipulation to ensure correct paths. Hmm. Actually, I just
realized that what's bothering me about NOT() is that we don't have a way
to negate relationships, only conditions. To put it another way, what's
the difference between "X has a Y that does not meet condition Z", and "X
does not have a Y that meets condition Z". One is a join with a negated
condition, and the other is a NOT EXISTS subquery. We don't have a way to
express the, latter unless we create a NOT_EXISTS() node type to wrap a
selector in. Such nodes would need to be directly converted to NOT-EXISTS
subqueries, and retrieving any values from that branch of the path tree
would be forbidden. (Although a node that's shared with another subtree
could still be retrieved.)
Having a NOT_EXISTS() seems rather like a kludge, though. OTOH, we will
probably also want a MAYBE() operator ala ConQuer, to express outer
joins. Thus, having such interior nodes to represent these specialized
join types maybe isn't such a bad thing. I think they would work like
regular nodes, not logical nodes. IOW, they would be part of their
childrens' paths, like /x/NOT_EXISTS/y or /x/MAYBE/y.
Well, I think that's enough for one ramble session. We now have a rough
design for logical nodes (AND, OR, NOT), join nodes (NOT_EXISTS, MAYBE),
selector nodes (type.where()), and relational operators (IS, IS_NOT, GT,
LT, EQ, etc.). We can now also more clearly restate our original example:
aCity = City.where()
aCountry = Country.where()
coResidentForeignSupervisors = Employee.where(
livesIn = aCity,
bornIn = aCountry,
supervises = Employee.where(
livesIn = aCity,
bornIn = IS_NOT(aCountry)
)
)
And indeed, I believe it's now possible for us to express in Python any
ConQuer query that does not involve aggregates or ternary
relationships. There are many details still to be fleshed out, mostly
well-formedness conditions such as "join nodes may not be root nodes",
"selector children of a root logical node must share the same type", and so
on. The latter is so you can't say something like AND(Employee.where(...),
Dog.where(...)). Hmm. I suppose that could make sense, if you wanted to
find all employees that were also dogs... Maybe we should just say that
all selector children of a root logical node are given an IS()
relationship, and treated as separate roots for table alias generation
purposes.
Anyway, that kind of stuff can be fleshed out a bit more when we get closer
to writing code, along with other issues like the best way to express that
a value should be retrieved, and what name if any it should have in the
retrieved result.
More information about the PEAK
mailing list