[PEAK] Why queries? Why records?
Phillip J. Eby
pje at telecommunity.com
Wed Feb 20 19:51:37 EST 2008
The more I look at SQLAlchemy, the more I want to just do the minimum
necessary to hook it up to the Trellis, and forget about anything
fancy. So, this post is an attempt to explore the reasons why I
think interactive Trellis applications need a record-oriented query
system -- and to see if my original assumptions about this still
apply. Or whether perhaps there's a simpler way to do things that
can avoid the complexity of implementing a full TRC (tuple relational
calculus) model.
Assume that some Trellis application is displaying a grid of dynamic
data -- a pretty common GUI application need. This data *might* be
from a database, but it might not be directly stored there. For
example, a calendar application might store event recurrence rules in
the database, but not every occurrence of the rule. So, you might
need to display some dynamically-generated items that aren't in a DB.
So far, no problem. As long as those items are in memory, you can
just use Trellis sets and a few rules to filter things down, no?
Well, maybe. If you use a simple looping rule to iterate over the
items, then it will have to be rerun any time even *one* item
changes, making display updates O(N) over the base set, not the
displayed subset. Worse, if you implement filters by composition,
where one set filters to a subset, that filters to a subset, each
subsequent set may need to re-loop to regenerate itself!
So, my previous conclusion was that the "obvious" fix was to use sets
of immutable records, so that everything could be represented in
terms of adding or removing records from sets. Since a record's
values can't change, subsets only need to respond to add or delete
events from their parent sets, in order to keep up with
changes. Trellis components could then be watched by rules that
added and deleted records from a master virtual set, from which all
other subsets would derive.
This approach has the advantage of using only a very small amount of
memory and very few cells. Each subset needs only one rule cell to
monitor its parent set, and each object needs only one rule cell to
keep its record-based representation up-to-date.
In contrast, an approach that monitors components directly instead of
using immutable records, must either have at least one rule cell *per
object per set* (if membership conditions are monitored
independently), or else be O(M) for any object change.
That is, you could have one rule evaluate all possible set membership
conditions, which would require only one rule cell per object and a
minimum of links. But any change to the object would require
re-evaluating all the membership conditions.
Hm. This isn't actually that different from the record-based
scenario, however, since any change to the object results in an old
record being deleted, and a new one added. This causes changes to
ripple down through the sets and re-evaluates all the conditions anyway!
Which means that as far as O() calculations are concerned in both
memory and speed, having a rule that evaluates set memberships on any
object change, is no different than having a rule that (re)generates
records on any object change... provided that the membership
evaluation is smart enough to prune the subset tree.
That is, in a record-oriented scenario, subsets of a set that doesn't
contain any records for a given object, do not have their conditions
re-evaluated when the object changes. That's because their parent
set neither adds nor deletes any records -- i.e., it doesn't care
about the change. So the subset doesn't need to do any membership testing.
To do the same thing in the pure object scenario, we would need to
somehow "tree-ify" the membership calculation, so that the same kinds
of cutoffs would apply. We can stop traversing the tree wherever a
membership test fails, and that the object wasn't previously a member
of the set. (If it was a member, then a deletion pass has to be made
to remove the object from any sets in that subtree where it was
previously a member, but no further membership tests need be done.)
A simple way to implement this would be to have a rule that walks any
"top-level" sets or splitters, calling a ._check_membership(ob)
method. This would check the object for membership, and recursively
propagate to listening subsets or splitters. So the rule would
dynamically depend only on the attributes used by the sets that the
object is currently a member of.
So, the net result is, we don't need to have a record-based system to
implement queries in a reasonable way. The O() performance in both
space and time is roughly the same for a non-record-based approach,
except that it doesn't have to construct a bunch of record tuples.
Well, almost. The one advantage remaining for the record-based
approach is that you can membership-test a tuple that is being
removed, in order to see whether it should be included in the
'deleted' event for downstream sets. But in an object-based world,
you would have to explicitly keep track of set membership in order to
issue correct update events. That may add some complication to the
scheme, as well as an O(N*M) memory consumption total (where M is the
average number of sets each of the N objects belongs to).
OTOH, schemes that treat all sets as virtual (i.e., don't keep track
of the members) have to pay extra in time when the set needs to be
iterated over. So, keeping track of the membership may be a
reasonable time/space tradeoff.
Anyway, what this means in practical terms is that we don't need a
record layer in order to do dynamic updates based on queries. Which
is mighty nice, in that it's one less thing to build in the short run.
We still will probably want something resembling a query language,
though. Even though _check_membership() can execute arbitrary Python
code -- and don't get me wrong, that's a *good* thing -- there are
places where it would be useful to be able to specify a query
expression once and have it apply to both database-backed queries and
in-memory filtering. (Because you want to track in-memory changes to
set membership within a collection of objects retrieved from a DB.)
Fortunately, the "fact base" model I sketched back in '04 allows a
fair degree of decoupling between a query "language" and its actual
implementation. Given a modest fact-base implementation, we could
literally have queries that were english language strings, as long as
somebody registers a corresponding implementation of each query,
containing the necessary Python+SQL!
So anywho, this is brilliantly good news, as it means I don't need to
design the full TRC, as interesting a project as that might have
been. I'm not all that fond of SQLAlchemy's query or filter
syntaxes, but if the underlying metamodel is sufficient to express
what's needed for in-memory queries, then we could add interpretation
to make it work for that. OTOH, it may be simpler to "compile"
Python syntax down to SQLAlchemy objects for SQL, and Python bytecode
for in-memory filters. PEAK-Rules already includes the bulk of the
needed library code for such a thing, and the result might be
generally useful as an SQLAlchemy extension.
Now, this particular post has all been about TRC queries applied to
records in-memory, and doesn't get into the question of integrating
with Chandler's EIM framework.
And my original assumption was that EIM would be an intermediate
record format used for persistence, sharing, querying, and SQL
generation. But now, EIM doesn't have to be as integral a part of
the system. In particular, non-persistent data can be queried
without needing to define a persistence or sharing schema for it.
There will still need to be a record-based schema definition system,
that implements EIM, and "compiles down" to SQLAlchemy schema, and
maybe mappings as well. I'm going to have to investigate
SQLAlchemy's mapping system and find out whether it could be abused
to do object-EIM mappings or EIM-EIM mappings, especially if wrapped
in some syntax sugar.
So, there's now one less library to implement, and greater clarity
about what remains. Updating the evolving plan, we get:
* Some way to hook trellis.Component objects to SQLAlchemy's ORM
without the magic of one disrupting the magic of the other :-)
* A collections framework with notifications and mutable contents
support, comprising a fact base service, splitters, and various sorts
of sets and utilities (for example, a SortedSet that works with
mutable objects, since the current one only handles adds/deletes).
* Some type of query language, either based on SQLAlchemy constructs
or compiling down to them, that can also be mapped to requests for
splitters and sets from the fact base
* Some type of record-oriented schema definition, that's
EIM-compatible and compiles down to SQLAlchemy; bonus points if
SQLAlchemy-style mappers can be used to turn EIM into objects or vice versa.
This seems like a good improvement over the previous plan, as it's
more incremental in delivering value, thus lowering the risk of going
for long periods without getting solid feedback. Also, very little
in this updated plan runs the risk of turning into a big research
project, unlike the TRC model.
What this approach gives up on, for the most part, is trying to
handle complex sorts of queries in a built-in way, in favor of
allowing complex tasks to be hand-wedged in, if need be. At this
stage of the game, that seems like a more than reasonable trade-off.
More information about the PEAK
mailing list