[PEAK] Why queries? Why records?

Wed Feb 20 19:51:37 EST 2008

The more I look at SQLAlchemy, the more I want to just do the minimum 
necessary to hook it up to the Trellis, and forget about anything 
fancy.  So, this post is an attempt to explore the reasons why I 
think interactive Trellis applications need a record-oriented query 
system -- and to see if my original assumptions about this still 
apply.  Or whether perhaps there's a simpler way to do things that 
can avoid the complexity of implementing a full TRC (tuple relational 
calculus) model.

Assume that some Trellis application is displaying a grid of dynamic 
data -- a pretty common GUI application need.  This data *might* be 
from a database, but it might not be directly stored there.  For 
example, a calendar application might store event recurrence rules in 
the database, but not every occurrence of the rule.  So, you might 
need to display some dynamically-generated items that aren't in a DB.

So far, no problem.  As long as those items are in memory, you can 
just use Trellis sets and a few rules to filter things down, no?

Well, maybe.  If you use a simple looping rule to iterate over the 
items, then it will have to be rerun any time even *one* item 
changes, making display updates O(N) over the base set, not the 
displayed subset.  Worse, if you implement filters by composition, 
where one set filters to a subset, that filters to a subset, each 
subsequent set may need to re-loop to regenerate itself!

So, my previous conclusion was that the "obvious" fix was to use sets 
of immutable records, so that everything could be represented in 
terms of adding or removing records from sets.  Since a record's 
values can't change, subsets only need to respond to add or delete 
events from their parent sets, in order to keep up with 
changes.  Trellis components could then be watched by rules that 
added and deleted records from a master virtual set, from which all 
other subsets would derive.

This approach has the advantage of using only a very small amount of 
memory and very few cells.  Each subset needs only one rule cell to 
monitor its parent set, and each object needs only one rule cell to 
keep its record-based representation up-to-date.

In contrast, an approach that monitors components directly instead of 
using immutable records, must either have at least one rule cell *per 
object per set* (if membership conditions are monitored 
independently), or else be O(M) for any object change.

That is, you could have one rule evaluate all possible set membership 
conditions, which would require only one rule cell per object and a 
minimum of links.  But any change to the object would require 
re-evaluating all the membership conditions.

Hm.  This isn't actually that different from the record-based 
scenario, however, since any change to the object results in an old 
record being deleted, and a new one added.  This causes changes to 
ripple down through the sets and re-evaluates all the conditions anyway!

Which means that as far as O() calculations are concerned in both 
memory and speed, having a rule that evaluates set memberships on any 
object change, is no different than having a rule that (re)generates 
records on any object change...  provided that the membership 
evaluation is smart enough to prune the subset tree.

That is, in a record-oriented scenario, subsets of a set that doesn't 
contain any records for a given object, do not have their conditions 
re-evaluated when the object changes.  That's because their parent 
set neither adds nor deletes any records -- i.e., it doesn't care 
about the change.  So the subset doesn't need to do any membership testing.

To do the same thing in the pure object scenario, we would need to 
somehow "tree-ify" the membership calculation, so that the same kinds 
of cutoffs would apply.  We can stop traversing the tree wherever a 
membership test fails, and that the object wasn't previously a member 
of the set.  (If it was a member, then a deletion pass has to be made 
to remove the object from any sets in that subtree where it was 
previously a member, but no further membership tests need be done.)

A simple way to implement this would be to have a rule that walks any 
"top-level" sets or splitters, calling a ._check_membership(ob) 
method.  This would check the object for membership, and recursively 
propagate to listening subsets or splitters.  So the rule would 
dynamically depend only on the attributes used by the sets that the 
object is currently a member of.

So, the net result is, we don't need to have a record-based system to 
implement queries in a reasonable way.  The O() performance in both 
space and time is roughly the same for a non-record-based approach, 
except that it doesn't have to construct a bunch of record tuples.

Well, almost.  The one advantage remaining for the record-based 
approach is that you can membership-test a tuple that is being 
removed, in order to see whether it should be included in the 
'deleted' event for downstream sets.  But in an object-based world, 
you would have to explicitly keep track of set membership in order to 
issue correct update events.  That may add some complication to the 
scheme, as well as an O(N*M) memory consumption total (where M is the 
average number of sets each of the N objects belongs to).

OTOH, schemes that treat all sets as virtual (i.e., don't keep track 
of the members) have to pay extra in time when the set needs to be 
iterated over.  So, keeping track of the membership may be a 
reasonable time/space tradeoff.

Anyway, what this means in practical terms is that we don't need a 
record layer in order to do dynamic updates based on queries.  Which 
is mighty nice, in that it's one less thing to build in the short run.

We still will probably want something resembling a query language, 
though.  Even though _check_membership() can execute arbitrary Python 
code -- and don't get me wrong, that's a *good* thing -- there are 
places where it would be useful to be able to specify a query 
expression once and have it apply to both database-backed queries and 
in-memory filtering.  (Because you want to track in-memory changes to 
set membership within a collection of objects retrieved from a DB.)

Fortunately, the "fact base" model I sketched back in '04 allows a 
fair degree of decoupling between a query "language" and its actual 
implementation.  Given a modest fact-base implementation, we could 
literally have queries that were english language strings, as long as 
somebody registers a corresponding implementation of each query, 
containing the necessary Python+SQL!

So anywho, this is brilliantly good news, as it means I don't need to 
design the full TRC, as interesting a project as that might have 
been.  I'm not all that fond of SQLAlchemy's query or filter 
syntaxes, but if the underlying metamodel is sufficient to express 
what's needed for in-memory queries, then we could add interpretation 
to make it work for that.  OTOH, it may be simpler to "compile" 
Python syntax down to SQLAlchemy objects for SQL, and Python bytecode 
for in-memory filters.  PEAK-Rules already includes the bulk of the 
needed library code for such a thing, and the result might be 
generally useful as an SQLAlchemy extension.

Now, this particular post has all been about TRC queries applied to 
records in-memory, and doesn't get into the question of integrating 
with Chandler's EIM framework.

And my original assumption was that EIM would be an intermediate 
record format used for persistence, sharing, querying, and SQL 
generation.  But now, EIM doesn't have to be as integral a part of 
the system.  In particular, non-persistent data can be queried 
without needing to define a persistence or sharing schema for it.

There will still need to be a record-based schema definition system, 
that implements EIM, and "compiles down" to SQLAlchemy schema, and 
maybe mappings as well.  I'm going to have to investigate 
SQLAlchemy's mapping system and find out whether it could be abused 
to do object-EIM mappings or EIM-EIM mappings, especially if wrapped 
in some syntax sugar.

So, there's now one less library to implement, and greater clarity 
about what remains.  Updating the evolving plan, we get:

* Some way to hook trellis.Component objects to SQLAlchemy's ORM 
without the magic of one disrupting the magic of the other  :-)

* A collections framework with notifications and mutable contents 
support, comprising a fact base service, splitters, and various sorts 
of sets and utilities (for example, a SortedSet that works with 
mutable objects, since the current one only handles adds/deletes).

* Some type of query language, either based on SQLAlchemy constructs 
or compiling down to them, that can also be mapped to requests for 
splitters and sets from the fact base

* Some type of record-oriented schema definition, that's 
EIM-compatible and compiles down to SQLAlchemy; bonus points if 
SQLAlchemy-style mappers can be used to turn EIM into objects or vice versa.

This seems like a good improvement over the previous plan, as it's 
more incremental in delivering value, thus lowering the risk of going 
for long periods without getting solid feedback.  Also, very little 
in this updated plan runs the risk of turning into a big research 
project, unlike the TRC model.

What this approach gives up on, for the most part, is trying to 
handle complex sorts of queries in a built-in way, in favor of 
allowing complex tasks to be hand-wedged in, if need be.  At this 
stage of the game, that seems like a more than reasonable trade-off.