[PEAK] The new trellis algorithm

Fri Nov 16 20:34:37 EST 2007

Overview and Goals
------------------

So, after doing some GUI work with the current trellis and wxPython, 
it becomes pretty apparent that for heavy-duty trellis work, you 
*really* need to be able to write imperative code as well as functional code.

To do this, some changes are required to the "ground rules" of how 
concurrency, conflicts, etc., are handled.

The goal is to make it so that you can -- to a certain extent -- make 
changes in rules, such that the changes are seen *immediately*.

We can do this and preserve order independence, IF we can re-run 
rules that have seen an inconsistent state, AND detect infinite 
recursion.  To do this, rules must not have external side effects, 
unless they are @action rules -- or unless they can be safely undone.

For cells, the undo logging will be easy, but data structure changes 
will need to be explicitly logged (usually by code inside the data 
structure's @modifier methods).  Data structure "undo" can also be 
implemented as a snapshot or copy-on-write, because the trellis only 
needs to do full rollbacks of any given cell or independent data structure.

We will still need conflict detection, but it needs to actually 
expand beyond what is currently supported by the existing trellis 
data structures.  Currently, only setting cells to a value can 
conflict.  But in principle, setting the same key in a Dict to two 
different values should be rejected, since this is an order 
dependency/write conflict.  Similarly, for one rule to add an item to 
a Set, and another to remove it, is also an order 
dependency/conflict.  (And similar issues apply to Lists, or really 
any trellis-ized data structures.)

Internal and External Side-Effects
----------------------------------

@action and @task rules are also an interesting case, in that they 
cannot be rolled back.  This introduces a rather peculiar problem, 
since we must defer
any changes that they make to a subsequent recalculation, while still 
detecting conflicts.  (And ideally reporting them with context 
information in the traceback.)

After bouncing some ideas off of Grant today, I realized that the key 
to fixing this is to change the way @action works.  Currently, 
@action rules can have external side effects *and* trellis side 
effects (with the latter deferred to a subsequent recalc).  But, in 
the new algorithm, we need something that can have external side 
effects and NO trellis side effects (since those could imply 
recalculations being needed).  I've chosen to call this new thing an 
@observer, mirroring the terminology of the original Lisp Cells 
package, and to imply the external and passive nature of the rule.

That leaves what to do with @tasks.  In the current algorithm, a 
resumed task is synchronous, in the sense that it cannot "miss" 
anything it's waiting for while it's "sleeping".  However, under the 
new algorithm, a task would not be guaranteed to awaken immediately 
after the conditions for resumption are met.  It would be possible 
for other tasks to be resumed "first", because under this algorithm, 
tasks would actually be queued and run as if they were executing as 
independent tasks *outside* the trellis.  In other words, @tasks are 
non-trellis code that get resumed as a side effect of things 
happening inside the trellis.

Now, we still don't want code inside of @rules or @tasks to see the 
effects of changes they make, in order to avoid dependency cycles and 
other paradoxes.  However, from a practical perspective, they *will* 
be able to see the *direct* effects of changes.  If you set a cell to 
a value within a rule and then read it, you will see the 
change.  This may also work for simple Set/Dict/List types, meaning 
we could perhaps allow .setdefault(), .pop(), etc. to work on the 
simple types.  (modulo any concerns re: conflict detection and undo)

However, for data structures that use any sort of rules to maintain 
their state, the rules may not have fired yet, and so the apparent 
state may not be consistent.

So, to be safe, a rule should generally make all its changes after 
all its calculations are complete, since it cannot rely on what 
should happen "after" it makes changes.

Side-Effects Summary
--------------------

So, a quick summary of how side-effects would be handled, in all five 
possible contexts:

@rule:
    Can see direct effects of their own changes, but no indirect 
effects.  Best not to look at what you've changed, unless you know 
for sure you're using a simple data structure with no internal rules.

@observer:
    Cannot make changes to trellis state: it's an error if you 
try.  Doesn't matter if it happens directly or indirectly.  (But you 
can always work around it by moving the relevant code to a @rule!)

@task:
    Can see direct effects of its own changes, but no indirect 
effects.  Best not to look at anything after the changes without 
doing a "yield" first.

@modifier:
    Is the thing actually making changes, so it better know what it's 
doing.  :)  Only direct side effects are visible.

Non-trellis code:
    Everything looks normal, no special rules or conditions to follow.

Of all of these contexts, only @modifier and non-trellis code are 
immune from dependency tracking.  @rule, @observer, and @task code 
has dependencies tracked for recalculation/re-invocation purposes.

Cycle Detection
---------------

One consequence of allowing rules to have side effects is that it's 
possible to have cycles, where a rule changes values that it 
(directly or indirectly) depends on.  This causes the rule to need to 
be rerun...  and rerun... and rerun...

In the current algorithm, we handle read cycles by noticing when we 
are reading something in the middle of calculating it, and simply 
returning the old value of the cell.  This works great in the current 
system because it allows rules to use their old values.

In the new algorithm, we could handle similar situations by noticing 
when a rule writes to a cell it has read.  This is logically exactly 
the same as returning a new value after reading the old, so we can 
handle it similarly.  We could notice when we are about to mark a 
rule dirty, whether we are doing it because of a change to a cell 
that the rule has written directly.  If so, we can simply ignore the 
change within the current recalculation.

With this innovation, we gain the ability to write simpler rules when 
multiple inputs and multiple outputs are involved -- so long as only 
one rule ever writes to those outputs.

Other cases aren't so easy to handle.  What happens if a rule depends 
*indirectly* on a value it changes?  That situation can't be resolved 
by "reading the old value", so it has to result in an error.

We can detect this, I think, by keeping track of *why* we are 
recalculating a cell, in the sense of knowing what other cell changed 
and triggered the recalculation.  When recalculating a cell that has 
already been calculated once in the current pulse, we can backtrack 
the links to see if the cell depends on itself, and raise an error.

In fact, if we actually use a layering system and priority queue (as 
I've described in previous posts), we need only perform the loop 
check when a change is propagated to a cell with a *lower* layer 
number than the cell that changed.  Under normal circumstances, 
propagation is always to cells with greater-or-equal layer numbers, 
so only cycles and the introduction of new side effects into the 
graph would incur the need for a cycle check.

This does bring in the question of what layer a written-to cell 
should be, in the case of a rule that reads and writes the same 
cell.  If layer changing takes place at dirty-notification, then the 
written-to cell will move above the corresponding rule, and the rule 
will remain below it.  However, if another rule reads the same cell, 
it will be forced above the first rule.

If that rule then proceeds to write to the same cell, the cell moves 
up again, and the rule that wrote it remains in place (since it 
doesn't get a dirty notice from its own change).

At this point, the changed cell notifies the first rule, and since 
the notice is going from a higher cell to a lower one, we check the 
propagation path and detect a cycle.

Hm.  Of course, this would only happen if the two rules wrote 
*different* values to the cell, which would ordinarily cause a 
conflict.  If the values were the same, then the second rule writing 
to the cell could not cause the first rule to be re-run, and there is 
no order conflict.  That is, it doesn't matter what order the two 
rules ran in, as the same result is produced regardless.

Conflict Detection
------------------

As mentioned above, we need to support fine-grained conflict 
detection for data structures like Dict, Set, and List, such that 
multiple rules can't mess with the data.

In the simplest case, we could disallow changes to the same structure 
by multiple rules at the same time.  This is pretty easy to do, since 
we simply track that the entire object has changed (by way of 
undo/change notification from the object): the trellis can thus just 
detect the conflict and raise an error.

But this is undesirable for data structures that *want* to be shared, 
and do not expect any conflicts under normal circumstances.  For 
example, suppose you are displaying a grid of data about various 
records.  You'd like to assemble a summary of changed fields from the 
visible records, so that you can update the display.  In the current 
trellis implementation, you must loop over *all* fields to generate a 
list, dict, set, or other collection of changed fields.

But with the new side-effect capability, you could have individual 
rules that write to a single, shared data structure, whenever the 
fields change.  Since each rule is contributing to only one piece of 
the whole, there is no way they can conflict.  Thus, you could have a 
rule that simply reads a collection of changes, that was efficiently generated.

To make this work, we either need the trellis to support fine-grained 
conflict detection using keys of some kind, *or* we need data 
structures to do their own conflict detection.  After all, if you 
*know* that a conflict can't happen, why incur the extra 
overhead?  Likewise, if you know that only one rule should be writing 
to the data structure, then it makes sense to save the storage and 
computational overhead of doing fine-grained tracking.

So, this suggests that the trellis should do coarse locking (only one 
changer of a data structure allowed) by default, and provide a means 
whereby data structures can do fine-grained locking, or even no locking.

In fact, it's not clear that it's the trellis, per se, that should do 
the conflict detection at all.  Cell objects can easily do their own, 
and so can any other data structure.

Rollback, Commit, and Change Notices
------------------------------------

It's up to an individual data structure to decide what "changed" 
means.  For most data structures we have now, change is an atomic 
thing.  Right now, if you change the value of one key in a 
dictionary, the whole thing is considered "changed".  So even if you 
have a rule that only watches one key, you will be recalculated a lot.

Of course, if you track changes to individual keys, then there are a 
lot more dependencies to record, more allocation and garbage 
collection, and so on.  So fine-grained tracking could also be quite expensive.

It seems the reasonable thing to do here is to let data structures 
decide these matters.  But *how*?

Cells currently use equality testing on a value to decide what has 
"changed".  This doesn't work well for mutable data structures, 
because copying is required for automatic change detection, or else 
you have to trigger changes manually anyway.

So, it seems that there needs to be an abstract sort of cell, that 
doesn't do any comparisons, but just records read/change operations, 
and can be linked to a recalculation/undo callback.

The tricky bits of this involve issues like managing reference 
cycles, and the actual mechanism for handling rollback/commit 
operations (which have to be handled by the data structure, but are 
invoked by the Trellis).

It seems to me that there are some connections between change 
notification, undo, conflict detection, and cycle detection that I'm 
not seeing clearly yet.  It seems like there should be some simple 
and powerful way to do all of these things together, but I don't see 
what it is yet.

For example, change==undo.  If you change something, you have to be 
able to undo it.  If you have to undo something, then something has 
changed.  However, in the model described thus far, the missing 
ingredient is saying *what* has changed.

It's not needed for undo, since one can simply roll back the entire 
undo log to a desired point in the recalculation.  But in order to 
notify rules that things have *changed*, we have to know "what" 
changed, in order to know if it's the same "what" that some rule 
previously read.

It seems as though we could have some sort of "node" objects with a 
interface like:

node.used() -- record that this node was read

node.changed(undo_func, *args) -- notify about changes, record an undo action

And if these changeable elements represent the nodes of our 
dependency graph, then we also need *edges*: the rules.

(Well, really, rules aren't graph edges per se.  Instead, the 
dependencies are edges, and groups of edges are effectively labelled by rules.)

Anyway, if we track dependencies forward and backward, so that each 
node knows who it depends on -- and what rule was used for that 
dependency -- as well as who depends on it (via weak references), 
then it should be possible to do the whole thing without rules 
needing to be special objects in any way.  It should be sufficient to 
invoke a rule via the trellis *once*, and it will live as long as any 
nodes exist that were changed by it.

However, if we take this approach to rules, there is a bit of a 
caveat.  A rule that produces *no* output (doesn't cause any changes 
to fire) should remain a dependency of its former 
targets.  Otherwise, if a rule doesn't produce a changed value, it 
will cease to be recalculated.  However, a rule that uses no *inputs* 
on a given run (no used() calls) can be safely discarded.

The Algorithm
-------------

Nodes have a layer number, initially zero.  Nodes changed outside a 
rule have their layer reset to zero.  Changed nodes have their layer 
number elevated to be at least 1 greater than the highest layer of 
any node read (other than themselves) by a rule that changes them.

Nodes remember (strongly) the input nodes that they depend on, and 
via what rule.  They also weakly reference the output nodes that 
depend on them.  When they change(), the nodes that depend on them 
are queued for updating, once their new layer numbers are known.  A 
node never depends directly upon itself.

While a rule is running, the used() nodes are tracked, so they can be 
added to the dependencies of all the changed() nodes the rule touches.

During recalculation, a priority heap of nodes requiring update is 
kept.  Whenever a node is marked dirty by a change to one of its 
inputs after it has already been validated during the same 
recalculation, the undo log must be rolled back to the point where it 
was previously calculated, after updating the
node's layer number.  (Updating the layer number changes the order in 
which the new attempt at recalculation will proceed, so that the same 
thing doesn't happen again.)

It's possible that a more efficient rollback could happen, in that 
only the nodes which depended on the dirty node really *need* to be 
rolled back.  But it may be a lot simpler, code-wise, to just keep a 
giant linear undo log and roll it back the same way.  (Especially 
since the undo log might be used to track changes to nodes' dependencies.)

When a node is marked dirty, the trellis will note what node marked 
it so, and add it to the heap using its (possibly increased) layer 
number as a priority.  When the node is reached in the heap, its 
dependencies are scanned to identify all of the rules that read any 
of the dependencies that marked the node dirty.  These rules are then 
re-run, with the new dependency links replacing the old.

Well, that's not quite right.  The catch is that we want to re-run a 
rule only once.  And because of that, it's not the nodes that have 
layer numbers -- it's the rules.

Why?  Because it's the rules that need ordering, not the 
nodes.  You're going to re-run a rule as soon as you reach the lowest 
layer of any of its outputs, so it's a waste to keep track of layer 
numbers for the nodes.

So, if we treat any outside-the-trellis changes as being done by a 
rule at layer 0, things make a bit more sense.  Any rule that depends 
on a value changed by that layer 0 rule then has its layer bumped to 
at least 1, and so on.

So in terms of the graph structure, we wrap rules in objects that 
keep the layer info, and nodes can weakly reference the rule-objects 
that read them, and strongly reference the ones that write 
them.  Rules in turn weakly reference their output nodes, and 
strongly reference their input nodes.  (This is somewhat unfortunate 
in that it seems to more than double the memory requirements of the 
current trellis!  At least, unless I come up with a more clever way 
to store the links.)

Observers are simply rules that live at "layer infinity".  When the 
priority heap reaches this layer, it continues running rules and 
updating dependency links, but changes are no longer allowed, and 
calling any node.changed() should trigger an error and a complete 
rollback of the recalculation.  In fact, any unexpected errors during 
recalculation should result in rollback.

Cycle detection occurs when an already-calculated rule is 
recalculated.  Whenever a rule is marked for calculation, we keep a 
record of what rule(s) caused it to be recalculated, and we can then 
use that record to backtrack and check for a cycle.  If there is one, 
we roll the whole thing back and raise an error -- *and* we can 
include all the involved rules in the error, for debugging!

To implement tasks, we also need the ability to post callbacks to be 
run after the observers are finished, so that tasks effectively run 
"outside" the trellis.  (We don't want to use EventLoop.call() for 
this, though, because tasks need to be able to work without an active 
event loop.)

Having this post-commit callback ability also allows us to handle 
discrete events/receiver cells: we can register a @modifier that 
resets them all to their default values!

In this way, most of the built-in features of the current trellis 
actually become pluggable, in the sense that they are all implemented 
using a lower-level API.  So Cells in the current model would be 
implemented by combining a "node" with a place to store a value, and 
some logic to decide when to mark the node .changed() or 
.used().  For cells with rules, the rule execution would just get 
wrapped by an API call, so that the rule winds up being tracked.  Of 
course, that rule would really be a method of the cell, that wraps 
the "real" rule with extra logic to set the cell's value to the 
return value, and mark it changed() if appropriate.

Read operations on cells are simple: just read the current value and 
mark the node .used(), unless the cell is new and has a rule (in 
which case you run the rule at layer 0 and recalc first).  Write 
operations are just a @modifier wrapping a write, a conflict check, 
and a .changed() call.

Ooh, one last thing: data structures with conflict detection need to 
be able to register commit-time callbacks for cleanup purposes.  But 
this could possibly be handled by giving nodes a slot where conflict 
data gets stored, and having all changed() nodes clear the slot at 
the end of recalculation.

Wrap-Up
-------

I think this covers most everything that's needed to implement the 
new approach, and throw open the gates to having a near-infinite 
variety of trellis-enabled data structures.  Designing something like 
this is hard, because there are a huge number of minor variations and 
special cases, even though the nominal conditions are quite simple.

Before actually going to implementation, though, I'm going to want to 
vet this model against a variety of use cases and requirements, 
especially ones reverse-engineered from the existing tests and 
codebase.  Then, I'll write a more detailed specification that's not 
so stream-of-consciousness, and try to nail down what the new APIs 
will look like.  One nice side-effect of this design is that a lot of 
features like 'ensure_recalculation()', 'dirty()', 'poll()' etc. 
should be capable of being implemented using the new system's 
primitives, rather than as filthy hacks of the system's innards.