[PEAK] A hub-and-spoke mechanism for efficient many-to-many cells
Phillip J. Eby
pje at telecommunity.com
Tue Jul 17 17:39:48 EDT 2007
After doing a lot of thinking about managing large mutable data
structures in the Trellis (especially mutually-dependent ones like
bidirectional relationships in an application database), I think I've
come up with a way to efficiently manage them.
Normally, the Trellis prefers values to be derived by rules
(functionally) rather than determined by setting values or modifying
data structures (imperatively). Imperative operations have an
efficiency penalty because they have to be deferred to the next
"pulse" of the Trellis, so that any rules that depend on *current*
values still have a chance to see them before they change.
That's the first efficiency problem with large sets; the other is
that if one set is computed as a subset of another large set, then
any change to the large set results in an O(N) operation to compute
the smaller set. So, you really need a way to obtain "diffs". You
may also may have *summarized* values, such as a sum of data from the
changed set, or a rule that applies to only one particular member of
the set. In such cases, you'd like to only be called when something
you care about changes, not when the entire set changes.
So, I've come up with a pair of basic mechanisms I call a Hub and
Spoke, that can be used to address these issues.
A Hub is a cell that can have multiple values written to it, given a
function that computes how its "future value" should change as each
value is written to it. When you read a Hub, it still returns its
*current* value, but as soon as the Trellis ticks over to a new
pulse, it will start returning its accumulated "future" value.
In this way, you can create mutable data structures by having mutator
methods put callables into a Hub attribute, that accumulate these
actions into a "roll-forward" log. Your main data cell is then
computed as a rule that invokes these actions before returning the data.
So Hub cells can work as accumulators for N imperative operations
applied to a single mutable data structure - an N:1 relationship.
On the flip side, 1:N relationships can be handled by Spokes. A
Spoke is a cell whose value can only be set by one rule, which must
be predeclared. Thus, you can set its value imperatively *without
requiring a new pulse*. Specifically, when a Spoke is asked for its
value, it can force the rule it depends on to be calculated
first. Thus, it doesn't need to wait for another moment of time to
be sure that it's up-to-date.
Here's a rewrite of my hypothetical "Time" service using Spokes:
class Time(trellis.Component, context.Service):
trellis.rules(
now = lambda self:
volatile() and Timestamp(time.time()),
_schedule = lambda self: [NOT_YET],
_events = lambda self: weakref.WeakValueDictionary(),
)
@trellis.rule
def next_event(self):
# Note that because 'now' is volatile() and always
# has a new value, we don't need to worry about
# tracking changes to self._schedule. This
# rule already gets re-run whenever self.now changes.
while self.now >= self._schedule[0]:
key = heapq.heappop(self._schedule)
if key in self._events:
self._events.pop(key).value = True
return self._schedule[0]
def after(self, when):
if when not in self._events:
if when <= self.now:
return True
heapq.heappush(self._schedule, when)
self._events[when] = e = \
self.__cells__['next_event'].spoke(False)
return self._events[when].value
The change between this and the older version is small: the after()
method creates Spoke cells linked to the 'next_event' rule, so that
when next_event fires, it fires any events whose time has
arrived... *in the same pulse*. Ordinarily, setting a cell's value
doesn't take effect until the *next* pulse, but with a spoke cell, it
can take effect immediately, as long as the setting rule doesn't try
to *read* the cell it's setting.
In effect, a Spoke cell is a one-way conduit from inside a rule to
outside of the rule. You can't put data in from the outside, nor
read data in on the inside.
But let's back up a second. You might be wondering, why can't I just
make cells that say "Time.now >= sometime"? And the reason for that
is that your rule will be rechecked every time the time
changes. Which is to say, it will be polled constantly! The Time
service therefore wants to imperatively trigger events no sooner than
the appropriate moment. When a rule calls Time.after(), it depends
on a cell (the Spoke) that does not change every time the clock advances.
So, any place where we might have a lot of rules depending on slight
variations in a condition on a frequently-changing value, we can use
Spokes and a control rule to make sure only the smallest relevant
subset of values is changed.
Now let's consider a bidirectional reference setup between two
attributes; let's say we have something like Person.likes and
Person.is_liked_by set attributes. The sets on each side could be
tied to either side of an "Association" or "Relationship"
object. The add and remove methods of these sets would then write
change operations into a Hub log attribute on the Relationship
object, representing links created or broken. And the sets would
have "added" and "removed" Spoke attributes linked to an update rule
on the relationship.
That update rule reads the log Hub and makes or breaks links. And as
it does so, it updates the "added" and "removed" spokes of the target
objects' sets. When the rule is finished, all the sets involved can
now update themselves (using the "added" and "removed" spokes) before
their contents can be seen by any rules.
Now we can implement subsets by defining a set's "added" items as a
filtered subset of the parent set's "added" items, and the "removed"
items as the same as the parent set's "removed" items.
Well, almost. If the filter depends on values that can also change,
then we still need a way to be notified when the filtered values
change in a way that would cause them to be removed. So the filter
needs to actually create a cell for each item, that determines
whether it's possibly a member, and if not, arranges for its future deletion.
Unfortunately, there is no way to make such deletions happen in the
same pulse, because there's no way to know which of possibly
thousands of maybe-updated cells could require a deletion. Thus, you
can't use a Spoke to make the deletion immediate.
Of course, it may not be that unfortunate, really. It's similar to
the situation where you update data in an MS Access datasheet, so
that the updated data to no longer matches the datasheet's
filter. Immediately updating the display would actually be
problematic from the UI standpoint: the user might accidentally
modify something they didn't mean to, and then have it confusingly
disappear from view. Requiring an explicit refresh here is a *good*
thing, allowing the user to say, "okay, I'm done changing these things now".
In other words, distinguishing between the idea of a tangible,
mutable set, and a transient snapshot *query*, may be a good idea.
In practice, this most comes into question in the case where you have
some generic interaction between things and the sets they are members
of. For example, Chandler has the notion of items belonging to
"collections", and collections may be either intentional or
extensional sets. That is, either they are collections you
explicitly add things to, or else membership is determined by rules.
An item in Chandler wants to know all the collections it's a member
of, so it can e.g. display them. A query-based approach would work
for the extensional (explicitly added) sets, but not for the
rule-based sets. You'd have to define its collections as the sum of
its extensional collections, plus the subset of all intentional
collections associated with those extensional collections that allow
the item as a member.
This would implicitly link the conditions to the collection
membership, such that membership would be recalculated whenever
attributes involved in the filters change. Likewise, changing the
filter rules or changing what intentional collections are attached to
a particular extensional collection, would force recalculation of
memberships for every item currently in memory, and update any
relevant displays accordingly.
With a bit more sophistication, this approach could be extended to do
materialized views (i.e., actually store extensional membership data)
and do immediate updates to queries.
This approach still doesn't work in "zero time", of
course. Rule-based collections get updated in the next moment,
because the adds and removes of individual items need to pass through
a Hub on the target collection, and as with any non-Spoke cell,
writing to it can only occur in the "next moment". Thus two updates
to the screen occur: the first changing the attributes affecting
collection membership, followed by a second one to update the
collection membership itself, along with any queries affected by the
change in membership.
Also, in this approach, the collections themselves, don't need to
keep a set of all their contents in memory at any given point in
time. They can simply have 'added' and 'removed' cells that reflect
their changes, and that UI views and query objects can depend
on. Thus, views and queries can update themselves without needing to
re-iterate over the entire collection contents.
Beautiful! I'm not thrilled with filtered collections running
behind, though. Anything that depends on those collections'
added/removed events will also run behind real time.
It's hard to think of a real problem caused by that, though. The
nature of the Trellis is that sequentially triggered updates like
this just automatically cascade until everything is "up to date" again.
The most obvious problem this could cause is if you manage to define
a paradoxical collection, such that the only members allowed are
non-members. This would cause an oscillating effect, such that as an
item is added, it is then removed, and vice versa, leading to an
infinite loop. I'm not sure *how* you could set that up,
though. You can't do it using just the membership filters; you have
to have something that gets set *based* on the membership filters.
Still, it would be nice if there was a way to define a real-time Hub,
such that the accumulation of changes from individually changed items
could happen in the "same" moment from the perspective of the
collections. I just don't see a way to do that without an N-way
dependency from the collections to all items in memory, or some sort
of hack involving the Trellis pulse logs.
The key is that to make a real-time hub -- just like with a Spoke --
you have to know before you read its value, whether everything that
*can* change the value, has already had a chance to. With a Spoke,
we can guarantee this easily because there's only one rule to
ask. For a Hub, I don't know of any way to do this without knowing
-- and querying! -- all N sources of data for the hub.
So, it simply boils down to the idea that large collections whose
membership is purely rule-based, cannot efficiently transmit update
events in a single pulse -- which breaks the otherwise "perfectly
simultaneous universe" of the Trellis. This means that rules which
rely on set-difference information for such collections, must not
*also* look at individual item data, in order to avoid seeing an
inconsistency in the fabric of the universe. :)
Ugly. But at least it works. Personally, all of this seems to me to
be an argument for layering an application's domain model over a
fact-based or relational model underneath, such that all imperative
model changes pass through "relationship" objects of the type
described earlier above. Such objects can then mediate set
membership events via spokes, such that everything changes in the
*same* moment.
In this model, *all* domain model values are mapped to set operations
at a lower level. A "single-valued field" is really a link between
an item's primary key, and the field. Changing the field value is
removing an old value from the 1-element set, and then adding a new
value. The underlying storage service can receive these changes via
a hub, and distribute them via spokes, so that changes always
propagate in a single pulse.
Hm. In fact, this is really the best model, because the filtering
problem goes away when you view things in a relational model, i.e.,
as sets of tuples. There is no way for a tuple to "change", so you
can easily make a filtered subset of a tuple set, by filtering the
parent set's "added" event, and using its "removed" event directly.
Application-level structures can then be implemented as rule-based
transforms of these tuple sets, mapping from keys to cached
objects. In other words, you end up with an object-relational
mapper, and instant updates of all queries, views, and collections. Yay!
A general "relational" service would then consist of a hub to
accumulate tuples being added or removed, combined with a cache of
spokes reflecting specific "addresses" in tuple space -- i.e.,
partial tuple matches based on type (i.e., table) and keys (primary
and foreign). A single update rule processes the added or removed
tuples and sends update notifications to various spokes.
Query operations against the relation service simply return spokes,
or cells computed using the spokes. For example, when you create a
domain object for a given primary key, you could create its cells so
that they link to spokes for each of the tables containing that
object's data, for the specific primary or foreign key(s).
Meanwhile, you could have a backing store that listens to the
relation service's add/remove events, and accumulates them in a
buffer for writing to the "real" database. Any attempt to perform a
query on the backing store can force this buffer to be written out
before the query is actually executed.
The relational service would query the backing store to get the
actual contents of sets as they're needed, and using schema
information (that it'll need anyway in order to know about primary
and foreign keys), it can decide whether to cache the whole thing or
just a subset. You could also just make a version of the relational
service that has no backing store, but just keeps everything in
memory. Querying such a service just gives you back a real set,
that's automatically updated via a spoke whenever somebody changes something.
Wow. That's pretty neat. ORM plus MVC equals application. :) You
can even implement transaction rollback and undo by keeping another
tuple-diff log, similar to the one kept to buffer DB writes, and
reversing the actions.
So, that is definitely the right way to go. There are some details
to be figured out here, though, in that you still need a special kind
of cell for the domain model attributes. Writing to a domain model
field, for example, should result in changes being immediately
written to the tuple-based layer beneath, rather than being chained via a cell.
So, there may need to be a third type of cell -- a Rim? Axle? --
that works by sending data to a Hub somewhere. Technically, however,
this can be implemented using a hub, since hubs will have code that
gets run whenever a value is set. So, that code can just interpret
the set operation appropriately.
This leaves one final problem, which is implementing domain
attributes with mutually dependent rules. It's not obvious that
these can still work in this model, without pushing those rules into
the fact tuple layer, instead of leaving them in the domain
layer. However, the rules can still be defined in the domain layer;
it's just that objects retrieved from the relational layer won't have
those rules in effect.
A simple "clone" operation, however, could be used to make a
scratchpad copy of the object for editing, with all the
interdependency rules in effect -- it's just another instance of the
domain object, with its default cell rules instead of the
database-backed ones. "Saving" the changed object would then just
mean copying the values back to the original object, automatically
updating all dependencies.
In other words, you could use the same trellis.Component class for
both database-backed and non-database backed instances, with the only
difference being their runtime Cell contents.
I can even envision domain model classes having an "errors" cell that
lists validation problems with the content of the object. The
interaction layer that's editing the object can display the errors in
real time during editing, and refuse to allow saving until there are
no errors left!
So, now all I need to do is implement Hubs, Spokes, and an O-R
mapping layer. ;-) (Unfortunately, reuse of an existing O-R mapper
isn't practical since the model heavily depends on the way writes to
the "R" level are propagated back to the "O" level.)
More information about the PEAK
mailing list