[PEAK] A hub-and-spoke mechanism for efficient many-to-many cells

Tue Jul 17 17:39:48 EDT 2007

After doing a lot of thinking about managing large mutable data 
structures in the Trellis (especially mutually-dependent ones like 
bidirectional relationships in an application database), I think I've 
come up with a way to efficiently manage them.

Normally, the Trellis prefers values to be derived by rules 
(functionally) rather than determined by setting values or modifying 
data structures (imperatively).  Imperative operations have an 
efficiency penalty because they have to be deferred to the next 
"pulse" of the Trellis, so that any rules that depend on *current* 
values still have a chance to see them before they change.

That's the first efficiency problem with large sets; the other is 
that if one set is computed as a subset of another large set, then 
any change to the large set results in an O(N) operation to compute 
the smaller set.  So, you really need a way to obtain "diffs".  You 
may also may have *summarized* values, such as a sum of data from the 
changed set, or a rule that applies to only one particular member of 
the set.  In such cases, you'd like to only be called when something 
you care about changes, not when the entire set changes.

So, I've come up with a pair of basic mechanisms I call a Hub and 
Spoke, that can be used to address these issues.

A Hub is a cell that can have multiple values written to it, given a 
function that computes how its "future value" should change as each 
value is written to it.  When you read a Hub, it still returns its 
*current* value, but as soon as the Trellis ticks over to a new 
pulse, it will start returning its accumulated "future" value.

In this way, you can create mutable data structures by having mutator 
methods put callables into a Hub attribute, that accumulate these 
actions into a "roll-forward" log.  Your main data cell is then 
computed as a rule that invokes these actions before returning the data.

So Hub cells can work as accumulators for N imperative operations 
applied to a single mutable data structure - an N:1 relationship.

On the flip side, 1:N relationships can be handled by Spokes.  A 
Spoke is a cell whose value can only be set by one rule, which must 
be predeclared.  Thus, you can set its value imperatively *without 
requiring a new pulse*.  Specifically, when a Spoke is asked for its 
value, it can force the rule it depends on to be calculated 
first.  Thus, it doesn't need to wait for another moment of time to 
be sure that it's up-to-date.

Here's a rewrite of my hypothetical "Time" service using Spokes:

     class Time(trellis.Component, context.Service):

         trellis.rules(
             now        = lambda self:
                 volatile() and Timestamp(time.time()),
             _schedule  = lambda self: [NOT_YET],
             _events    = lambda self: weakref.WeakValueDictionary(),
         )

         @trellis.rule
         def next_event(self):
             # Note that because 'now' is volatile() and always
             # has a new value, we don't need to worry about
             # tracking changes to self._schedule.  This
             # rule already gets re-run whenever self.now changes.
             while self.now >= self._schedule[0]:
                 key = heapq.heappop(self._schedule)
                 if key in self._events:
                     self._events.pop(key).value = True
             return self._schedule[0]

         def after(self, when):
             if when not in self._events:
                 if when <= self.now:
                     return True
                 heapq.heappush(self._schedule, when)
                 self._events[when] = e = \
                     self.__cells__['next_event'].spoke(False)
             return self._events[when].value

The change between this and the older version is small: the after() 
method creates Spoke cells linked to the 'next_event' rule, so that 
when next_event fires, it fires any events whose time has 
arrived...  *in the same pulse*.  Ordinarily, setting a cell's value 
doesn't take effect until the *next* pulse, but with a spoke cell, it 
can take effect immediately, as long as the setting rule doesn't try 
to *read* the cell it's setting.

In effect, a Spoke cell is a one-way conduit from inside a rule to 
outside of the rule.  You can't put data in from the outside, nor 
read data in on the inside.

But let's back up a second.  You might be wondering, why can't I just 
make cells that say "Time.now >= sometime"?  And the reason for that 
is that your rule will be rechecked every time the time 
changes.  Which is to say, it will be polled constantly!  The Time 
service therefore wants to imperatively trigger events no sooner than 
the appropriate moment.  When a rule calls Time.after(), it depends 
on a cell (the Spoke) that does not change every time the clock advances.

So, any place where we might have a lot of rules depending on slight 
variations in a condition on a frequently-changing value, we can use 
Spokes and a control rule to make sure only the smallest relevant 
subset of values is changed.

Now let's consider a bidirectional reference setup between two 
attributes; let's say we have something like Person.likes and 
Person.is_liked_by set attributes.  The sets on each side could be 
tied to either side of an "Association" or "Relationship" 
object.  The add and remove methods of these sets would then write 
change operations into a Hub log attribute on the Relationship 
object, representing links created or broken.  And the sets would 
have "added" and "removed" Spoke attributes linked to an update rule 
on the relationship.

That update rule reads the log Hub and makes or breaks links.  And as 
it does so, it updates the "added" and "removed" spokes of the target 
objects' sets.  When the rule is finished, all the sets involved can 
now update themselves (using the "added" and "removed" spokes) before 
their contents can be seen by any rules.

Now we can implement subsets by defining a set's "added" items as a 
filtered subset of the parent set's "added" items, and the "removed" 
items as the same as the parent set's "removed" items.

Well, almost.  If the filter depends on values that can also change, 
then we still need a way to be notified when the filtered values 
change in a way that would cause them to be removed.  So the filter 
needs to actually create a cell for each item, that determines 
whether it's possibly a member, and if not, arranges for its future deletion.

Unfortunately, there is no way to make such deletions happen in the 
same pulse, because there's no way to know which of possibly 
thousands of maybe-updated cells could require a deletion.  Thus, you 
can't use a Spoke to make the deletion immediate.

Of course, it may not be that unfortunate, really.  It's similar to 
the situation where you update data in an MS Access datasheet, so 
that the updated data to no longer matches the datasheet's 
filter.  Immediately updating the display would actually be 
problematic from the UI standpoint: the user might accidentally 
modify something they didn't mean to, and then have it confusingly 
disappear from view.  Requiring an explicit refresh here is a *good* 
thing, allowing the user to say, "okay, I'm done changing these things now".

In other words, distinguishing between the idea of a tangible, 
mutable set, and a transient snapshot *query*, may be a good idea.

In practice, this most comes into question in the case where you have 
some generic interaction between things and the sets they are members 
of.  For example, Chandler has the notion of items belonging to 
"collections", and collections may be either intentional or 
extensional sets.  That is, either they are collections you 
explicitly add things to, or else membership is determined by rules.

An item in Chandler wants to know all the collections it's a member 
of, so it can e.g. display them.  A query-based approach would work 
for the extensional (explicitly added) sets, but not for the 
rule-based sets.  You'd have to define its collections as the sum of 
its extensional collections, plus the subset of all intentional 
collections associated with those extensional collections that allow 
the item as a member.

This would implicitly link the conditions to the collection 
membership, such that membership would be recalculated whenever 
attributes involved in the filters change.  Likewise, changing the 
filter rules or changing what intentional collections are attached to 
a particular extensional collection, would force recalculation of 
memberships for every item currently in memory, and update any 
relevant displays accordingly.

With a bit more sophistication, this approach could be extended to do 
materialized views (i.e., actually store extensional membership data) 
and do immediate updates to queries.

This approach still doesn't work in "zero time", of 
course.  Rule-based collections get updated in the next moment, 
because the adds and removes of individual items need to pass through 
a Hub on the target collection, and as with any non-Spoke cell, 
writing to it can only occur in the "next moment".  Thus two updates 
to the screen occur: the first changing the attributes affecting 
collection membership, followed by a second one to update the 
collection membership itself, along with any queries affected by the 
change in membership.

Also, in this approach, the collections themselves, don't need to 
keep a set of all their contents in memory at any given point in 
time.  They can simply have 'added' and 'removed' cells that reflect 
their changes, and that UI views and query objects can depend 
on.  Thus, views and queries can update themselves without needing to 
re-iterate over the entire collection contents.

Beautiful!  I'm not thrilled with filtered collections running 
behind, though.  Anything that depends on those collections' 
added/removed events will also run behind real time.

It's hard to think of a real problem caused by that, though.  The 
nature of the Trellis is that sequentially triggered updates like 
this just automatically cascade until everything is "up to date" again.

The most obvious problem this could cause is if you manage to define 
a paradoxical collection, such that the only members allowed are 
non-members.  This would cause an oscillating effect, such that as an 
item is added, it is then removed, and vice versa, leading to an 
infinite loop.  I'm not sure *how* you could set that up, 
though.  You can't do it using just the membership filters; you have 
to have something that gets set *based* on the membership filters.

Still, it would be nice if there was a way to define a real-time Hub, 
such that the accumulation of changes from individually changed items 
could happen in the "same" moment from the perspective of the 
collections.  I just don't see a way to do that without an N-way 
dependency from the collections to all items in memory, or some sort 
of hack involving the Trellis pulse logs.

The key is that to make a real-time hub -- just like with a Spoke -- 
you have to know before you read its value, whether everything that 
*can* change the value, has already had a chance to.  With a Spoke, 
we can guarantee this easily because there's only one rule to 
ask.  For a Hub, I don't know of any way to do this without knowing 
-- and querying! -- all N sources of data for the hub.

So, it simply boils down to the idea that large collections whose 
membership is purely rule-based, cannot efficiently transmit update 
events in a single pulse -- which breaks the otherwise "perfectly 
simultaneous universe" of the Trellis.  This means that rules which 
rely on set-difference information for such collections, must not 
*also* look at individual item data, in order to avoid seeing an 
inconsistency in the fabric of the universe.  :)

Ugly.  But at least it works.  Personally, all of this seems to me to 
be an argument for layering an application's domain model over a 
fact-based or relational model underneath, such that all imperative 
model changes pass through "relationship" objects of the type 
described earlier above.  Such objects can then mediate set 
membership events via spokes, such that everything changes in the 
*same* moment.

In this model, *all* domain model values are mapped to set operations 
at a lower level.  A "single-valued field" is really a link between 
an item's primary key, and the field.  Changing the field value is 
removing an old value from the 1-element set, and then adding a new 
value.  The underlying storage service can receive these changes via 
a hub, and distribute them via spokes, so that changes always 
propagate in a single pulse.

Hm.  In fact, this is really the best model, because the filtering 
problem goes away when you view things in a relational model, i.e., 
as sets of tuples.  There is no way for a tuple to "change", so you 
can easily make a filtered subset of a tuple set, by filtering the 
parent set's "added" event, and using its "removed" event directly.

Application-level structures can then be implemented as rule-based 
transforms of these tuple sets, mapping from keys to cached 
objects.  In other words, you end up with an object-relational 
mapper, and instant updates of all queries, views, and collections.  Yay!

A general "relational" service would then consist of a hub to 
accumulate tuples being added or removed, combined with a cache of 
spokes reflecting specific "addresses" in tuple space -- i.e., 
partial tuple matches based on type (i.e., table) and keys (primary 
and foreign).  A single update rule processes the added or removed 
tuples and sends update notifications to various spokes.

Query operations against the relation service simply return spokes, 
or cells computed using the spokes.  For example, when you create a 
domain object for a given primary key, you could create its cells so 
that they link to spokes for each of the tables containing that 
object's data, for the specific primary or foreign key(s).

Meanwhile, you could have a backing store that listens to the 
relation service's add/remove events, and accumulates them in a 
buffer for writing to the "real" database.  Any attempt to perform a 
query on the backing store can force this buffer to be written out 
before the query is actually executed.

The relational service would query the backing store to get the 
actual contents of sets as they're needed, and using schema 
information (that it'll need anyway in order to know about primary 
and foreign keys), it can decide whether to cache the whole thing or 
just a subset.  You could also just make a version of the relational 
service that has no backing store, but just keeps everything in 
memory.  Querying such a service just gives you back a real set, 
that's automatically updated via a spoke whenever somebody changes something.

Wow.  That's pretty neat.  ORM plus MVC equals application.  :)  You 
can even implement transaction rollback and undo by keeping another 
tuple-diff log, similar to the one kept to buffer DB writes, and 
reversing the actions.

So, that is definitely the right way to go.  There are some details 
to be figured out here, though, in that you still need a special kind 
of cell for the domain model attributes.  Writing to a domain model 
field, for example, should result in changes being immediately 
written to the tuple-based layer beneath, rather than being chained via a cell.

So, there may need to be a third type of cell -- a Rim?  Axle? -- 
that works by sending data to a Hub somewhere.  Technically, however, 
this can be implemented using a hub, since hubs will have code that 
gets run whenever a value is set.  So, that code can just interpret 
the set operation appropriately.

This leaves one final problem, which is implementing domain 
attributes with mutually dependent rules.  It's not obvious that 
these can still work in this model, without pushing those rules into 
the fact tuple layer, instead of leaving them in the domain 
layer.  However, the rules can still be defined in the domain layer; 
it's just that objects retrieved from the relational layer won't have 
those rules in effect.

A simple "clone" operation, however, could be used to make a 
scratchpad copy of the object for editing, with all the 
interdependency rules in effect -- it's just another instance of the 
domain object, with its default cell rules instead of the 
database-backed ones.  "Saving" the changed object would then just 
mean copying the values back to the original object, automatically 
updating all dependencies.

In other words, you could use the same trellis.Component class for 
both database-backed and non-database backed instances, with the only 
difference being their runtime Cell contents.

I can even envision domain model classes having an "errors" cell that 
lists validation problems with the content of the object.  The 
interaction layer that's editing the object can display the errors in 
real time during editing, and refuse to allow saving until there are 
no errors left!

So, now all I need to do is implement Hubs, Spokes, and an O-R 
mapping layer.  ;-)  (Unfortunately, reuse of an existing O-R mapper 
isn't practical since the model heavily depends on the way writes to 
the "R" level are propagated back to the "O" level.)