[PEAK] TrellisDB basics and imperative programming in the trellis

Tue Aug 21 16:22:51 EDT 2007

[This is a post I was working on last Monday before my PC crashed; 
it's slightly out of date  relative to my current thinking on the new 
algorithm, but it does cover some basic points.  Throughout, I'm 
commenting in brackets where there are changes since last week.]

In previous posts, I've laid out how TrellisDB will likely work for 
queries, using various straightforward ways of mapping changes to 
sets into changes on subsets, joined sets, etc.

However, there are some other things that TrellisDB will need to do, 
like defining quasi-relational schemas (similar to the EIM model I 
created for Chandler), and mapping Component changes to relational changes.

When a component is requested from the DB, it needs to be set up in 
such a way that its changes are propagated back as EIM-like 
diffs.  That is, as a specification of what records to add, modify, 
or delete.  These changes can then propagate through dependent sets, 
and be accumulated in backend write logs to ensure that backends get 
updated before the next time they are queried, or whenever an 
explicit flush is requested.  [08/21 - I'm no longer thinking that 
EIM-like diffs are necessarily the best way to propagate this 
information; individual backends can simply map set membership 
operations onto whatever mechanism is appropriate for the backend.]

In order for these changes to be processed in "real time" (i.e. a 
single recalc sweep), there needs to be a way to combine changes from 
multiple sources -- the classic "hub" issue.

One of the goals of the priority-heap recalculation algorithm is to 
allow hubs to receive information that is "pushed" to them in some 
way, without needing to explicitly iterate all possible inputs.

A side effect of this approach is that it's necessary to roll back 
part of the calculation if a "layer inversion" occurs -- if a rule 
reads a higher-layered cell than itself.  However, if we allow such 
rollbacks, then it seems like it should be possible to do the 
reverse: roll back modifications when a rule *writes* to a 
*lower*-layered cell than itself!

That is, we could perhaps make it possible for trellis rules to write 
to cells without needing to be in @action rules, after all.  This 
would simplify a lot of things, but make other things more difficult.

The simplifying part is that you could write everything in a 
near-imperative style, as long as you can handle rollbacks.  (You 
still won't be able to "read the future", though).  The difficult 
part is that every data structure type will need to handle 
rollbacks!  This could be simplified a bit through appropriate undo 
(or maybe redo) support, though.

The other difficult part is circularity.  For reads, circularity 
bottoms out by reading the previous value of a cell.  For writes, the 
naive approach of simply bumping your target's layer and rolling back 
to where it was read, would lead to an infinite loop in the case of 
circular writes.  There'd need to be some way to detect the cycle and 
halt it, like counting the number of up-leveling writes and bailing 
out at some arbitrary limit.  (Probably the simplest way to do this, 
is that whenever you raise a cell's layer number by writing to it, 
you also renumber all the cells that are observing it, 
recursively.  If in this process you find yourself back at the 
writing cell, you know you have a circular writer.)

The last difficult part is write-conflict management.  To support a 
truly imperative style, we would need to handle changes by multiple 
rules, but we want to maintain order independence.  A simple way to 
keep order independence would be to logically "lock" a written cell 
so that it's "owned" by the rule that writes it; writing to that cell 
from anywhere else would then be an error.

But that approach disallows hubs -- the main thing that I'm trying to 
support!  If you have a shared data structure like a set or 
dictionary, it should be possible for multiple rules to add or remove 
items, provided that each one sees the same effective "start" state, 
and that none of them change anything the others has read, or vice versa.

Supporting this is certainly *possible*, but it has the downside of 
making custom data structures even harder to write.  You need to be 
able to define what a serializable history is for a given data 
type.  More precisely, you must be able to tell whether a particular 
collection of read/write operations is order-independent with respect 
to any other collection of read/write operations on that data 
type.  For example, if you only add items to a set, that is order 
independent with respect to adding other items or even the same 
items.  Removing an item, however, is only order independent relative 
to removing *different* items from the set.

A brute force way of determining this would be to undo and redo every 
rule that targets the same cell(s) (so as to complete them in more 
than one order), then see if the results are any different.  But I 
don't think that's really going to work very well in 
practice.  :)  It's also very inefficient, since in a 
properly-written Trellis, the code should be correct.  Ideally, we 
want the lowest possible overhead for correct code.

It seems that the best tradeoff here is to simply require that (in 
the general case) only one rule is allowed to write to a given data 
structure during a single recalculation.  This restriction would be 
loosened for an explicitly-defined "hub" cell type, which would be 
allowed to accumulate input from multiple "spoke" cells.  The output 
of a hub cell would simply be a list of the values provided by any 
changed "spoke" rules.  You could then write other rules that would 
use the hub output to perform some serialized application of the data 
that was thus accumulated -- including changes to other data 
structures, as long as that rule is the only one that writes to them.

To borrow an idea from concurrent programming, the only sane way to 
write a program that contains non-determinism is to start with a 
deterministic system and add non-determinism as necessary.  Hubs 
allow us to introduce a tiny bit of explicit non-determinism in a 
controlled way.  Of course, they're not *really* non-deterministic 
within a given program run; it's just that if you add or remove 
rules, the order of results could change.  We could, however, add 
some sort of "testing" switch you could use to cause hub outputs to 
get randomly shuffled, in case you want to prove your code is truly 
order-independent.  :)

So, what would be the big differences between this new idea and the 
0.5b1 algorithm?

1. Rules can be interrupted by a LayerChange exception if they read 
from a cell in a higher layer (and that cell hasn't been calculated 
yet), or write to a cell in a lower layer (and that call has already 
been calculated).  A LayerChange exception during reading simply 
reschedules the rule to wait until after the dependency has been 
calculated, but one during writing forces a rollback of all 
calculations (and writes!) back to the point just before the target 
cell was first read.  [08/21 - note that later in this post, we get 
rid of the need for the exception; it suffices to re-run the rule 
after an undo.]

2. @action would stay around, but you wouldn't need it just for a 
rule that did some writing.  In fact, @action rules would NOT be able 
to do writes, as that would be inherently circular.  [08/21 - but we 
could possibly loosen this to allow such writes to trigger a new 
recalculation cycle, as mentioned toward the end of this post]

3. The current cell write conflict rules would remain in effect, but 
there would be an additional way to conflict: two different rules 
writing to the same "future" would be an error.  [08/21 - some 
details still need to be hammered out to implement this latter 
aspect, as it requires some sort of additional API]

I really wish there was an easy way to avoid point #1 - especially 
for writes, which will require some expensive machinery to roll 
back.  (Bailing out of a read is trivial by comparison.)  However, as 
best I can tell, avoiding the bailout/rollback requires that you know 
in advance which rules are going to read or write what.  In the 
"normal" cases, these errors won't happen because the shape of the 
trellis will quickly stabilize to reflect the true relative 
priorities of the cells involved, as long as the program as a whole is acyclic.

I suppose that the plus side is that with these conditions, *any* 
error can be rolled back, all the way to the point of its origin, 
preserving the trellis' data integrity 100%.  Data structure code 
only has to guarantee reversibility; the Trellis will take care of 
calling all of the rollback functions if an untrapped error occurs in 
a @modifier.

Recalculations will have to track all the cells recalculated along 
with the undo log, so that the undo log can be unwound all the way to 
the point where an "early read" (reading of a cell that would be 
written to later) occurred.

One interesting side-effect of this algorithm is that it gets rid of 
some special-casing that occurs in the current 
implementation.  Specifically, the cell-initialization bugs that bit 
Grant more than once.  In the current implementation, we special case 
the first time a cell is set, so that the value appears in that cell 
immediately.  There are various tricky aspects to doing this right, 
including how to handle multiple writes before the cell is read.

However, in the new algorithm, writing a cell *always* has immediate 
effect, and it's *always* safe to write multiple values to a cell, as 
long as no one else has read it and the writes all take place in the 
same rule.  The current implementation makes these things happen only 
for new cells; under the new algorithm, they should naturally fall 
out of the "normal" way of doing things.

[08/21 - in discussion last week with Grant, it works out that we 
also need for writes to a cell to always take precedence over the 
rule's calculation, if both a write and recalc happen in the same 
pulse.  The rule must still *run* in order to update dependencies, 
but its new value must be thrown out, and it must not have any side 
effects.  That is, the rule mustn't cause anything to be written to 
the cell's undo log.]

To verify that, I'll need to review the algorithm in detail, 
though.  Heck, I need to review it in detail anyway, as the above is 
more of a sketch of an idea than it is an actual algorithm.

The new idea is essentially that writes take place immediately, and 
reads from a rule/cell that's doing the writing don't conflict.  You 
simply can't write to a cell if it has already been read in the 
current pulse by some *other* cell.  If that happens, we force a 
rollback to the point where it was read, after elevating its layer to 
be higher than the writer's layer.  We also check to make sure that 
the target cell hasn't been read (directly or indirectly) by its 
writer, since that's a circular dependency.

Aha!  Better idea: instead of aborting rules with an error when 
something happens that we don't like, we just let them return.  The 
Trellis can sort out the errors after the function returns.  We just 
track which cells the rule reads and writes, along with the undo log 
for that rule.  Instead of rolling back changes when a target has 
already been read, we simply allow the target to *re*-notify its 
observers, which will recalculate them again.  We only use the undo 
log when recalculating a rule that made changes, so that side-effects 
don't happen twice.

Likewise, if we read a value that might be dirty (i.e., it's on a 
higher layer), that's okay because we'll get recalculated if it 
results in a change.

In other words, the recalculation algorithm is roughly:

* put a cell on the priority heap
* while the heap is non-emtpy, pop a cell and recalculate it, keeping 
an undo log for writes, and tracking what cells it reads and writes.
* update the cell's layer to reflect the cells it has read
* if its value changed, put its observers on the priority heap (with 
their layers adjusted higher, if necessary)
* if it wrote to any cells, put those cells' observers on the 
priority heap (with their layers adjusted likewise)

Well, that's not quite it.  We don't put a cell on the priority heap 
if it's flagged dirty, and we flag it dirty when we change one of its 
dependencies.

If we read a cell that's flagged dirty, we recalculate it, and apply 
the same rules.

A key difference from the current algorithm is that we can re-run the 
same rule more than once, if necessary.  @action rules still run only 
once, though, as they'll be considered "layer infinity" (much the 
same as they are now).  We only re-run rules or undo their actions if 
backtracking made a difference to the cells that those rules read.

To prevent infinite regress, we must disallow a cell from reading a 
cell that it wrote, or writing a cell that it read, however 
indirectly.  Either that, or we need an arbitrary repetition limit 
that counts how many times a rule has been recalculated.  The latter 
is probably less overhead and easier to implement than a more direct 
way of detecting the problem.  (We can't use Python's recursion 
limit, since the algorithm is mostly non-recursive.)

One really interesting thing about the place where this algorithm has 
wound up, is that it's not much different from traditional event 
driven programming, *except* that external side-effects are 
partitioned into @action rules, and the rest can be re-run as many 
times as necessary to arrive at a consistent result, with 
undo-ability for any intermediate results that are invalidated due to 
a change in the overall graph structure.

Unfortunately, this is also a pretty major rework of the 
algorithm.  Fortunately, the algorithm is maybe only 20-25% of the 
total peak.events.trellis module.  All the high-level APIs could 
remain largely untouched.  The data structures (List, Set, Dict) 
might also need some rework, but probably not much.  We could keep 
using the "future" mechanism to avoid needing an explicit "undo" 
mechanism, since "future" effectively works by writing a value to a cell.

It would suffice to keep track of the value of the cell before it was 
written, and if the rule is recalculated we simply restart it with 
its previous "before" value.  Similarly, when recalculating a rule 
more than once, we want to restore that rule's previous "before" 
value as well (so that self-referential rules remain consistent).

Whew.  It's a big and potentially messy change, but we get to have 
hubs and (kind of) imperative code.  At the moment, it also seems 
quite possible that the new algorithm will use less memory *and* 
processor time than the current version, although the devil is 
definitely in the details.  More precisely, it's likely that a lot 
*more* memory will be used during recalculation, but possibly quite a 
bit less than the static memory requirements of the current 
algorithm.  And in non-pathological situations, it should do a lot 
less dependency analysis.

I'll need to write another post to spell out the behavior 
specifications in detail, though.  This time I don't want any 
under-specified corner cases popping up.  So there will be a thorough 
re-write of the current Specification.txt to describe the new 
algorithm in excruciating detail.  That re-write will also need to 
cover the correct processing of receivers and discrete rules, as well 
as how to handle poll() and repeat() properly in this new paradigm.

For multi-tasking, I had already given @action semantics to @task 
methods.  This is the right thing to do in the new algorithm simply 
because we can't back a generator up a step!  However, if we're 
losing the ability to set values in @action rules, then we also lose 
the ability to set them from generators, which potentially decreases 
their usefulness.  It may be that we want to keep the current 
semantics of doing writes in @action and @task rules, i.e., that this 
triggers a new round of calculation after the current one is 
finished.  That will also relate to how we handle repeat(), since 
that has a similar effect.