[PEAK] More hub-and-spoke-database-ry :)

Wed Jul 25 13:18:45 EDT 2007

I've been thinking a bit more about how a factbase system would work 
with the trellis, and I think I've got it narrowed down to something 
interesting.

So, your fact base itself is going to look something like this:

     class FactBase(trellis.Component):

         cache = trellis.rule(lambda self: WeakValueDictionary())

         def __getitem__(self, key):
             try:
                 return self.cache[key]
             except KeyError:
                 data = self.cache[key] = self.makeSet(key)
                 return data

Where 'self.makeSet()' is either hardcoded or an open-ended generic function.

So far, this looks a lot like Calendar or Month from my previous 
example.  But let's go all the way and say you can add or delete 
arbitrary objects from this sucker:

         def add(self, item):
             self.log = ('add', item)

         def remove(self, item):
             self.log = ('del', item)

I'll assume here that 'log' is a Hub attribute that keeps track of 
actions, such that 'self.added' and 'self.removed' are event cells 
containing sets of records that have been added or removed as of any 
given point in time.

The code to do that isn't trivial to sketch right now, since Hubs 
don't exist yet.  However, once that code exists, it'll probably be 
part of some sort of "trellis.Set" base class that FactBase can 
simply inherit from.  That is, FactBase will be a mutable set that 
also acts like a dictionary of sets.

Okay, so suppose we now want to know about all the records of a 
particular type that have been added to our fact base, such that we 
get updates on added or deleted records of that type.

Let's say that we want to just say "myFactBase[SomeRecordType]" to 
get back a set of all records of that type.  So we implement 
'makeSet(key)' for keys that are record types, such that it returns a 
set.  To do this, we need an object I'll call a "router":

     class RecordTypeRouter(trellis.Component):
         trellis.values(
             factbase = None,
         )

         cache = trellis.rule(lambda self: WeakValueDictionary())

         @trellis.rule
         def update(self):
             cache = self.cache
             added = {}
             deleted = {}
             for record in self.factbase.deleted:
                 deleted.setdefault(type(record), set()).add(record)
             for record in self.factbase.added :
                 added.setdefault(type(record), set()).add(record)
             for k, v in deleted.iteritems():
                 if k in cache:
                     cache[k].deleted = v
             for k, v in added.iteritems():
                 if k in cache:
                     cache[k].added = v

         def __getitem__(self, key):
             try:
                 return self.cache[key]
             except KeyError:
                 update = self.__cells__['update']
                 data = self.cache[key] = VirtualSet(
                    added   = update.spoke((), event=True),
                    deleted = update.spoke((), event=True),
                 )
                 return data

So what this does is that it distributes change events, keyed on 
record type.  Adding records of a type nobody cares about, doesn't 
update any sets.  Only sets you're "looking at" (and thus are cached) 
get add/delete events.  You could probably generalize this a bit 
more, and make a FilterRouter that uses a function and a FilterSet 
that uses the function plus a value, e.g.:

     class FilteringRouter(trellis.Component):
         trellis.values(
             parent = None,
             keyfunc = type,
         )
         trellis.cell_factories(
             added = lambda self: parent.__cells__['added'],
             deleted = lambda self: parent.__cells__['deleted'],
         )
         cache = trellis.rule(lambda self: WeakValueDictionary())

         @trellis.rule
         def update(self):
             cache = self.cache
             key_of = self.keyfunc
             added = {}
             deleted = {}
             for record in self.deleted:
                 deleted.setdefault(key_of(record), set()).add(record)
             for record in self.added :
                 added.setdefault(key_of(record), set()).add(record)
             for k, v in deleted.iteritems():
                 if k in cache:
                     cache[k].deleted = v
             for k, v in added.iteritems():
                 if k in cache:
                     cache[k].added = v

         def __getitem__(self, key):
             try:
                 return self.cache[key]
             except KeyError:
                 update = self.__cells__['update']
                 data = self.cache[key] = FilteredSet(
                    added   = update.spoke((), event=True),
                    deleted = update.spoke((), event=True),
                    keyfunc = self.keyfunc, key = key,
                    parent = self.parent
                 )
                 return data

Now, FilteredSet can in principle implement __iter__ and __contains__ 
in terms of its parent set (plus keyfunc(record)==key tests), and you 
can implement arbitrary hierarchies of "addressable sets" in a 
factbase, e.g. factbase[RecordType, keyfield, keyval] to retrieve a 
set of records by the contents of a particular field.

I don't really like __getitem__ having to know what type of set you 
want, though.  Plus, this mechanism of using __getitem__ and caches 
in the first place is ugly, because it forces the thing you're 
accessing to know how to create the thing you want.  That seems like 
a dependency inversion.

It seems like what we really want is to be able to use the 
ObjectRoles package to define various kinds of routers and sets as 
*aspects* of a factbase.  Except that we only want them to stay 
around if they're in use, so we need a kind of "weak role".  This 
would be straightforward to add to the ObjectRoles package, so that 
we could then do something like:

     class FilteringRouter(roles.WeakRole, trellis.Component):
         def __init__(self, parent, keyfunc, **kw):
             trellis.Component.__init__(parent=parent, keyfunc=keyfunc, **kw)

         # ...all the previous router code

Ugh.  It still needs a cache, because the cell needs to know which 
spokes to send the data to.  At least now we can attach 
FilteringRouter instances to the FilteredSet instances, e.g.:

     FilteringRouter(FilteringRouter(factbase, type)[someType], get_key_attr)

Except maybe we should just call these things "splitters", i.e.:

     records = Splitter(Splitter(factbase, type)[someType], get_key_attr)[key]

This makes it a little clearer what's going on.  It also shows why I 
wanted to use a factbase cache, so you could use a key to go straight 
to the desired set, without needing to use hairy access paths like 
this one every time you need the set.  It certainly seems reasonable 
to make splitters be WeakRoles, though, since that lets you attach as 
many of them as you like to a set.

At least the only thing that needs any special handling for 
__getitem__ is the factbase's "makeSet" method.  That way, a key like 
'someRecordType.someAttr' can be mapped to 
"Splitter(Splitter(factbase, type)[someRecordType], 
someAttr.__get__)", which means that 
'factbase[someRecordType.someAttr][attrVal]' would return a set of 
the records with that attribute value, with automatic updating.

Of course, it's possible that the set for a particular record type 
isn't actually going to be produced via a type-based splitter; it may 
instead be a database-backed set of some type.  So, there will 
probably need to be splitters beyond just the simple filter-based 
type I've outlined here.

It's also possible that making the entire factbase have 
add()/remove() methods and added/removed events is a bad idea, versus 
separating things by record type from the very beginning.  If a 
particular set of record types are all mapped to a single DB backend, 
that backend can do something like that, in order to log changes that 
need to be flushed to SQL or disk or whatever.  But it's not 
necessary to do that across the entire factbase.

Hm.  Actually, I suppose there *is* a benefit, which is that you can 
also implement global undo by logging all changes.  Okay, I guess I'm sold.

Iteration and containment tests for sets are a bit vague at this 
point, but if splitters didn't directly create a FilteredSet, and 
instead asked the parent set to create a filtered set, then this 
would allow DB-backed sets to provide sets that do queries (or apply 
filters) when you try to iterate or membership-test.

Hm.  This is starting to sound almost workable.  :)  Queries with 
live updates, roll-forward and undo logging, multiple backends (or no 
backends!), with support for arbitrary Python transform and filter 
operations.  Nice.

Still a lot of bits to sort out, since this is all purely the "flat 
records" level, rather than the ORM level.  Same general principles 
apply, except that it's virtually impossible to do any of this fancy 
stuff on the object level.  You basically want to convert to objects 
only *after* all of the actual query work is done, because then you 
don't have to deal with the two-levels-of-updating problem that 
object-based (as opposed to tuple-based) sets have.

So, in the long run, this will need a way to specify tuple-level 
queries using object-level abstractions.  Oh, and joins.  We'll 
definitely need joins.  They shouldn't be difficult to specify, 
however, using appropriate Splitters for their bases.

At that point, we will have what is effectively a Rete network 
implementation -- i.e., a forward-chaining rules engine, with 
database backing.  Not bad at all.