[PEAK] The new trellis algorithm
Phillip J. Eby
pje at telecommunity.com
Fri Nov 16 20:34:37 EST 2007
Overview and Goals
------------------
So, after doing some GUI work with the current trellis and wxPython,
it becomes pretty apparent that for heavy-duty trellis work, you
*really* need to be able to write imperative code as well as functional code.
To do this, some changes are required to the "ground rules" of how
concurrency, conflicts, etc., are handled.
The goal is to make it so that you can -- to a certain extent -- make
changes in rules, such that the changes are seen *immediately*.
We can do this and preserve order independence, IF we can re-run
rules that have seen an inconsistent state, AND detect infinite
recursion. To do this, rules must not have external side effects,
unless they are @action rules -- or unless they can be safely undone.
For cells, the undo logging will be easy, but data structure changes
will need to be explicitly logged (usually by code inside the data
structure's @modifier methods). Data structure "undo" can also be
implemented as a snapshot or copy-on-write, because the trellis only
needs to do full rollbacks of any given cell or independent data structure.
We will still need conflict detection, but it needs to actually
expand beyond what is currently supported by the existing trellis
data structures. Currently, only setting cells to a value can
conflict. But in principle, setting the same key in a Dict to two
different values should be rejected, since this is an order
dependency/write conflict. Similarly, for one rule to add an item to
a Set, and another to remove it, is also an order
dependency/conflict. (And similar issues apply to Lists, or really
any trellis-ized data structures.)
Internal and External Side-Effects
----------------------------------
@action and @task rules are also an interesting case, in that they
cannot be rolled back. This introduces a rather peculiar problem,
since we must defer
any changes that they make to a subsequent recalculation, while still
detecting conflicts. (And ideally reporting them with context
information in the traceback.)
After bouncing some ideas off of Grant today, I realized that the key
to fixing this is to change the way @action works. Currently,
@action rules can have external side effects *and* trellis side
effects (with the latter deferred to a subsequent recalc). But, in
the new algorithm, we need something that can have external side
effects and NO trellis side effects (since those could imply
recalculations being needed). I've chosen to call this new thing an
@observer, mirroring the terminology of the original Lisp Cells
package, and to imply the external and passive nature of the rule.
That leaves what to do with @tasks. In the current algorithm, a
resumed task is synchronous, in the sense that it cannot "miss"
anything it's waiting for while it's "sleeping". However, under the
new algorithm, a task would not be guaranteed to awaken immediately
after the conditions for resumption are met. It would be possible
for other tasks to be resumed "first", because under this algorithm,
tasks would actually be queued and run as if they were executing as
independent tasks *outside* the trellis. In other words, @tasks are
non-trellis code that get resumed as a side effect of things
happening inside the trellis.
Now, we still don't want code inside of @rules or @tasks to see the
effects of changes they make, in order to avoid dependency cycles and
other paradoxes. However, from a practical perspective, they *will*
be able to see the *direct* effects of changes. If you set a cell to
a value within a rule and then read it, you will see the
change. This may also work for simple Set/Dict/List types, meaning
we could perhaps allow .setdefault(), .pop(), etc. to work on the
simple types. (modulo any concerns re: conflict detection and undo)
However, for data structures that use any sort of rules to maintain
their state, the rules may not have fired yet, and so the apparent
state may not be consistent.
So, to be safe, a rule should generally make all its changes after
all its calculations are complete, since it cannot rely on what
should happen "after" it makes changes.
Side-Effects Summary
--------------------
So, a quick summary of how side-effects would be handled, in all five
possible contexts:
@rule:
Can see direct effects of their own changes, but no indirect
effects. Best not to look at what you've changed, unless you know
for sure you're using a simple data structure with no internal rules.
@observer:
Cannot make changes to trellis state: it's an error if you
try. Doesn't matter if it happens directly or indirectly. (But you
can always work around it by moving the relevant code to a @rule!)
@task:
Can see direct effects of its own changes, but no indirect
effects. Best not to look at anything after the changes without
doing a "yield" first.
@modifier:
Is the thing actually making changes, so it better know what it's
doing. :) Only direct side effects are visible.
Non-trellis code:
Everything looks normal, no special rules or conditions to follow.
Of all of these contexts, only @modifier and non-trellis code are
immune from dependency tracking. @rule, @observer, and @task code
has dependencies tracked for recalculation/re-invocation purposes.
Cycle Detection
---------------
One consequence of allowing rules to have side effects is that it's
possible to have cycles, where a rule changes values that it
(directly or indirectly) depends on. This causes the rule to need to
be rerun... and rerun... and rerun...
In the current algorithm, we handle read cycles by noticing when we
are reading something in the middle of calculating it, and simply
returning the old value of the cell. This works great in the current
system because it allows rules to use their old values.
In the new algorithm, we could handle similar situations by noticing
when a rule writes to a cell it has read. This is logically exactly
the same as returning a new value after reading the old, so we can
handle it similarly. We could notice when we are about to mark a
rule dirty, whether we are doing it because of a change to a cell
that the rule has written directly. If so, we can simply ignore the
change within the current recalculation.
With this innovation, we gain the ability to write simpler rules when
multiple inputs and multiple outputs are involved -- so long as only
one rule ever writes to those outputs.
Other cases aren't so easy to handle. What happens if a rule depends
*indirectly* on a value it changes? That situation can't be resolved
by "reading the old value", so it has to result in an error.
We can detect this, I think, by keeping track of *why* we are
recalculating a cell, in the sense of knowing what other cell changed
and triggered the recalculation. When recalculating a cell that has
already been calculated once in the current pulse, we can backtrack
the links to see if the cell depends on itself, and raise an error.
In fact, if we actually use a layering system and priority queue (as
I've described in previous posts), we need only perform the loop
check when a change is propagated to a cell with a *lower* layer
number than the cell that changed. Under normal circumstances,
propagation is always to cells with greater-or-equal layer numbers,
so only cycles and the introduction of new side effects into the
graph would incur the need for a cycle check.
This does bring in the question of what layer a written-to cell
should be, in the case of a rule that reads and writes the same
cell. If layer changing takes place at dirty-notification, then the
written-to cell will move above the corresponding rule, and the rule
will remain below it. However, if another rule reads the same cell,
it will be forced above the first rule.
If that rule then proceeds to write to the same cell, the cell moves
up again, and the rule that wrote it remains in place (since it
doesn't get a dirty notice from its own change).
At this point, the changed cell notifies the first rule, and since
the notice is going from a higher cell to a lower one, we check the
propagation path and detect a cycle.
Hm. Of course, this would only happen if the two rules wrote
*different* values to the cell, which would ordinarily cause a
conflict. If the values were the same, then the second rule writing
to the cell could not cause the first rule to be re-run, and there is
no order conflict. That is, it doesn't matter what order the two
rules ran in, as the same result is produced regardless.
Conflict Detection
------------------
As mentioned above, we need to support fine-grained conflict
detection for data structures like Dict, Set, and List, such that
multiple rules can't mess with the data.
In the simplest case, we could disallow changes to the same structure
by multiple rules at the same time. This is pretty easy to do, since
we simply track that the entire object has changed (by way of
undo/change notification from the object): the trellis can thus just
detect the conflict and raise an error.
But this is undesirable for data structures that *want* to be shared,
and do not expect any conflicts under normal circumstances. For
example, suppose you are displaying a grid of data about various
records. You'd like to assemble a summary of changed fields from the
visible records, so that you can update the display. In the current
trellis implementation, you must loop over *all* fields to generate a
list, dict, set, or other collection of changed fields.
But with the new side-effect capability, you could have individual
rules that write to a single, shared data structure, whenever the
fields change. Since each rule is contributing to only one piece of
the whole, there is no way they can conflict. Thus, you could have a
rule that simply reads a collection of changes, that was efficiently generated.
To make this work, we either need the trellis to support fine-grained
conflict detection using keys of some kind, *or* we need data
structures to do their own conflict detection. After all, if you
*know* that a conflict can't happen, why incur the extra
overhead? Likewise, if you know that only one rule should be writing
to the data structure, then it makes sense to save the storage and
computational overhead of doing fine-grained tracking.
So, this suggests that the trellis should do coarse locking (only one
changer of a data structure allowed) by default, and provide a means
whereby data structures can do fine-grained locking, or even no locking.
In fact, it's not clear that it's the trellis, per se, that should do
the conflict detection at all. Cell objects can easily do their own,
and so can any other data structure.
Rollback, Commit, and Change Notices
------------------------------------
It's up to an individual data structure to decide what "changed"
means. For most data structures we have now, change is an atomic
thing. Right now, if you change the value of one key in a
dictionary, the whole thing is considered "changed". So even if you
have a rule that only watches one key, you will be recalculated a lot.
Of course, if you track changes to individual keys, then there are a
lot more dependencies to record, more allocation and garbage
collection, and so on. So fine-grained tracking could also be quite expensive.
It seems the reasonable thing to do here is to let data structures
decide these matters. But *how*?
Cells currently use equality testing on a value to decide what has
"changed". This doesn't work well for mutable data structures,
because copying is required for automatic change detection, or else
you have to trigger changes manually anyway.
So, it seems that there needs to be an abstract sort of cell, that
doesn't do any comparisons, but just records read/change operations,
and can be linked to a recalculation/undo callback.
The tricky bits of this involve issues like managing reference
cycles, and the actual mechanism for handling rollback/commit
operations (which have to be handled by the data structure, but are
invoked by the Trellis).
It seems to me that there are some connections between change
notification, undo, conflict detection, and cycle detection that I'm
not seeing clearly yet. It seems like there should be some simple
and powerful way to do all of these things together, but I don't see
what it is yet.
For example, change==undo. If you change something, you have to be
able to undo it. If you have to undo something, then something has
changed. However, in the model described thus far, the missing
ingredient is saying *what* has changed.
It's not needed for undo, since one can simply roll back the entire
undo log to a desired point in the recalculation. But in order to
notify rules that things have *changed*, we have to know "what"
changed, in order to know if it's the same "what" that some rule
previously read.
It seems as though we could have some sort of "node" objects with a
interface like:
node.used() -- record that this node was read
node.changed(undo_func, *args) -- notify about changes, record an undo action
And if these changeable elements represent the nodes of our
dependency graph, then we also need *edges*: the rules.
(Well, really, rules aren't graph edges per se. Instead, the
dependencies are edges, and groups of edges are effectively labelled by rules.)
Anyway, if we track dependencies forward and backward, so that each
node knows who it depends on -- and what rule was used for that
dependency -- as well as who depends on it (via weak references),
then it should be possible to do the whole thing without rules
needing to be special objects in any way. It should be sufficient to
invoke a rule via the trellis *once*, and it will live as long as any
nodes exist that were changed by it.
However, if we take this approach to rules, there is a bit of a
caveat. A rule that produces *no* output (doesn't cause any changes
to fire) should remain a dependency of its former
targets. Otherwise, if a rule doesn't produce a changed value, it
will cease to be recalculated. However, a rule that uses no *inputs*
on a given run (no used() calls) can be safely discarded.
The Algorithm
-------------
Nodes have a layer number, initially zero. Nodes changed outside a
rule have their layer reset to zero. Changed nodes have their layer
number elevated to be at least 1 greater than the highest layer of
any node read (other than themselves) by a rule that changes them.
Nodes remember (strongly) the input nodes that they depend on, and
via what rule. They also weakly reference the output nodes that
depend on them. When they change(), the nodes that depend on them
are queued for updating, once their new layer numbers are known. A
node never depends directly upon itself.
While a rule is running, the used() nodes are tracked, so they can be
added to the dependencies of all the changed() nodes the rule touches.
During recalculation, a priority heap of nodes requiring update is
kept. Whenever a node is marked dirty by a change to one of its
inputs after it has already been validated during the same
recalculation, the undo log must be rolled back to the point where it
was previously calculated, after updating the
node's layer number. (Updating the layer number changes the order in
which the new attempt at recalculation will proceed, so that the same
thing doesn't happen again.)
It's possible that a more efficient rollback could happen, in that
only the nodes which depended on the dirty node really *need* to be
rolled back. But it may be a lot simpler, code-wise, to just keep a
giant linear undo log and roll it back the same way. (Especially
since the undo log might be used to track changes to nodes' dependencies.)
When a node is marked dirty, the trellis will note what node marked
it so, and add it to the heap using its (possibly increased) layer
number as a priority. When the node is reached in the heap, its
dependencies are scanned to identify all of the rules that read any
of the dependencies that marked the node dirty. These rules are then
re-run, with the new dependency links replacing the old.
Well, that's not quite right. The catch is that we want to re-run a
rule only once. And because of that, it's not the nodes that have
layer numbers -- it's the rules.
Why? Because it's the rules that need ordering, not the
nodes. You're going to re-run a rule as soon as you reach the lowest
layer of any of its outputs, so it's a waste to keep track of layer
numbers for the nodes.
So, if we treat any outside-the-trellis changes as being done by a
rule at layer 0, things make a bit more sense. Any rule that depends
on a value changed by that layer 0 rule then has its layer bumped to
at least 1, and so on.
So in terms of the graph structure, we wrap rules in objects that
keep the layer info, and nodes can weakly reference the rule-objects
that read them, and strongly reference the ones that write
them. Rules in turn weakly reference their output nodes, and
strongly reference their input nodes. (This is somewhat unfortunate
in that it seems to more than double the memory requirements of the
current trellis! At least, unless I come up with a more clever way
to store the links.)
Observers are simply rules that live at "layer infinity". When the
priority heap reaches this layer, it continues running rules and
updating dependency links, but changes are no longer allowed, and
calling any node.changed() should trigger an error and a complete
rollback of the recalculation. In fact, any unexpected errors during
recalculation should result in rollback.
Cycle detection occurs when an already-calculated rule is
recalculated. Whenever a rule is marked for calculation, we keep a
record of what rule(s) caused it to be recalculated, and we can then
use that record to backtrack and check for a cycle. If there is one,
we roll the whole thing back and raise an error -- *and* we can
include all the involved rules in the error, for debugging!
To implement tasks, we also need the ability to post callbacks to be
run after the observers are finished, so that tasks effectively run
"outside" the trellis. (We don't want to use EventLoop.call() for
this, though, because tasks need to be able to work without an active
event loop.)
Having this post-commit callback ability also allows us to handle
discrete events/receiver cells: we can register a @modifier that
resets them all to their default values!
In this way, most of the built-in features of the current trellis
actually become pluggable, in the sense that they are all implemented
using a lower-level API. So Cells in the current model would be
implemented by combining a "node" with a place to store a value, and
some logic to decide when to mark the node .changed() or
.used(). For cells with rules, the rule execution would just get
wrapped by an API call, so that the rule winds up being tracked. Of
course, that rule would really be a method of the cell, that wraps
the "real" rule with extra logic to set the cell's value to the
return value, and mark it changed() if appropriate.
Read operations on cells are simple: just read the current value and
mark the node .used(), unless the cell is new and has a rule (in
which case you run the rule at layer 0 and recalc first). Write
operations are just a @modifier wrapping a write, a conflict check,
and a .changed() call.
Ooh, one last thing: data structures with conflict detection need to
be able to register commit-time callbacks for cleanup purposes. But
this could possibly be handled by giving nodes a slot where conflict
data gets stored, and having all changed() nodes clear the slot at
the end of recalculation.
Wrap-Up
-------
I think this covers most everything that's needed to implement the
new approach, and throw open the gates to having a near-infinite
variety of trellis-enabled data structures. Designing something like
this is hard, because there are a huge number of minor variations and
special cases, even though the nominal conditions are quite simple.
Before actually going to implementation, though, I'm going to want to
vet this model against a variety of use cases and requirements,
especially ones reverse-engineered from the existing tests and
codebase. Then, I'll write a more detailed specification that's not
so stream-of-consciousness, and try to nail down what the new APIs
will look like. One nice side-effect of this design is that a lot of
features like 'ensure_recalculation()', 'dirty()', 'poll()' etc.
should be capable of being implemented using the new system's
primitives, rather than as filthy hacks of the system's innards.
More information about the PEAK
mailing list