[PEAK] TrellisDB basics and imperative programming in the trellis
Phillip J. Eby
pje at telecommunity.com
Tue Aug 21 16:22:51 EDT 2007
[This is a post I was working on last Monday before my PC crashed;
it's slightly out of date relative to my current thinking on the new
algorithm, but it does cover some basic points. Throughout, I'm
commenting in brackets where there are changes since last week.]
In previous posts, I've laid out how TrellisDB will likely work for
queries, using various straightforward ways of mapping changes to
sets into changes on subsets, joined sets, etc.
However, there are some other things that TrellisDB will need to do,
like defining quasi-relational schemas (similar to the EIM model I
created for Chandler), and mapping Component changes to relational changes.
When a component is requested from the DB, it needs to be set up in
such a way that its changes are propagated back as EIM-like
diffs. That is, as a specification of what records to add, modify,
or delete. These changes can then propagate through dependent sets,
and be accumulated in backend write logs to ensure that backends get
updated before the next time they are queried, or whenever an
explicit flush is requested. [08/21 - I'm no longer thinking that
EIM-like diffs are necessarily the best way to propagate this
information; individual backends can simply map set membership
operations onto whatever mechanism is appropriate for the backend.]
In order for these changes to be processed in "real time" (i.e. a
single recalc sweep), there needs to be a way to combine changes from
multiple sources -- the classic "hub" issue.
One of the goals of the priority-heap recalculation algorithm is to
allow hubs to receive information that is "pushed" to them in some
way, without needing to explicitly iterate all possible inputs.
A side effect of this approach is that it's necessary to roll back
part of the calculation if a "layer inversion" occurs -- if a rule
reads a higher-layered cell than itself. However, if we allow such
rollbacks, then it seems like it should be possible to do the
reverse: roll back modifications when a rule *writes* to a
*lower*-layered cell than itself!
That is, we could perhaps make it possible for trellis rules to write
to cells without needing to be in @action rules, after all. This
would simplify a lot of things, but make other things more difficult.
The simplifying part is that you could write everything in a
near-imperative style, as long as you can handle rollbacks. (You
still won't be able to "read the future", though). The difficult
part is that every data structure type will need to handle
rollbacks! This could be simplified a bit through appropriate undo
(or maybe redo) support, though.
The other difficult part is circularity. For reads, circularity
bottoms out by reading the previous value of a cell. For writes, the
naive approach of simply bumping your target's layer and rolling back
to where it was read, would lead to an infinite loop in the case of
circular writes. There'd need to be some way to detect the cycle and
halt it, like counting the number of up-leveling writes and bailing
out at some arbitrary limit. (Probably the simplest way to do this,
is that whenever you raise a cell's layer number by writing to it,
you also renumber all the cells that are observing it,
recursively. If in this process you find yourself back at the
writing cell, you know you have a circular writer.)
The last difficult part is write-conflict management. To support a
truly imperative style, we would need to handle changes by multiple
rules, but we want to maintain order independence. A simple way to
keep order independence would be to logically "lock" a written cell
so that it's "owned" by the rule that writes it; writing to that cell
from anywhere else would then be an error.
But that approach disallows hubs -- the main thing that I'm trying to
support! If you have a shared data structure like a set or
dictionary, it should be possible for multiple rules to add or remove
items, provided that each one sees the same effective "start" state,
and that none of them change anything the others has read, or vice versa.
Supporting this is certainly *possible*, but it has the downside of
making custom data structures even harder to write. You need to be
able to define what a serializable history is for a given data
type. More precisely, you must be able to tell whether a particular
collection of read/write operations is order-independent with respect
to any other collection of read/write operations on that data
type. For example, if you only add items to a set, that is order
independent with respect to adding other items or even the same
items. Removing an item, however, is only order independent relative
to removing *different* items from the set.
A brute force way of determining this would be to undo and redo every
rule that targets the same cell(s) (so as to complete them in more
than one order), then see if the results are any different. But I
don't think that's really going to work very well in
practice. :) It's also very inefficient, since in a
properly-written Trellis, the code should be correct. Ideally, we
want the lowest possible overhead for correct code.
It seems that the best tradeoff here is to simply require that (in
the general case) only one rule is allowed to write to a given data
structure during a single recalculation. This restriction would be
loosened for an explicitly-defined "hub" cell type, which would be
allowed to accumulate input from multiple "spoke" cells. The output
of a hub cell would simply be a list of the values provided by any
changed "spoke" rules. You could then write other rules that would
use the hub output to perform some serialized application of the data
that was thus accumulated -- including changes to other data
structures, as long as that rule is the only one that writes to them.
To borrow an idea from concurrent programming, the only sane way to
write a program that contains non-determinism is to start with a
deterministic system and add non-determinism as necessary. Hubs
allow us to introduce a tiny bit of explicit non-determinism in a
controlled way. Of course, they're not *really* non-deterministic
within a given program run; it's just that if you add or remove
rules, the order of results could change. We could, however, add
some sort of "testing" switch you could use to cause hub outputs to
get randomly shuffled, in case you want to prove your code is truly
order-independent. :)
So, what would be the big differences between this new idea and the
0.5b1 algorithm?
1. Rules can be interrupted by a LayerChange exception if they read
from a cell in a higher layer (and that cell hasn't been calculated
yet), or write to a cell in a lower layer (and that call has already
been calculated). A LayerChange exception during reading simply
reschedules the rule to wait until after the dependency has been
calculated, but one during writing forces a rollback of all
calculations (and writes!) back to the point just before the target
cell was first read. [08/21 - note that later in this post, we get
rid of the need for the exception; it suffices to re-run the rule
after an undo.]
2. @action would stay around, but you wouldn't need it just for a
rule that did some writing. In fact, @action rules would NOT be able
to do writes, as that would be inherently circular. [08/21 - but we
could possibly loosen this to allow such writes to trigger a new
recalculation cycle, as mentioned toward the end of this post]
3. The current cell write conflict rules would remain in effect, but
there would be an additional way to conflict: two different rules
writing to the same "future" would be an error. [08/21 - some
details still need to be hammered out to implement this latter
aspect, as it requires some sort of additional API]
I really wish there was an easy way to avoid point #1 - especially
for writes, which will require some expensive machinery to roll
back. (Bailing out of a read is trivial by comparison.) However, as
best I can tell, avoiding the bailout/rollback requires that you know
in advance which rules are going to read or write what. In the
"normal" cases, these errors won't happen because the shape of the
trellis will quickly stabilize to reflect the true relative
priorities of the cells involved, as long as the program as a whole is acyclic.
I suppose that the plus side is that with these conditions, *any*
error can be rolled back, all the way to the point of its origin,
preserving the trellis' data integrity 100%. Data structure code
only has to guarantee reversibility; the Trellis will take care of
calling all of the rollback functions if an untrapped error occurs in
a @modifier.
Recalculations will have to track all the cells recalculated along
with the undo log, so that the undo log can be unwound all the way to
the point where an "early read" (reading of a cell that would be
written to later) occurred.
One interesting side-effect of this algorithm is that it gets rid of
some special-casing that occurs in the current
implementation. Specifically, the cell-initialization bugs that bit
Grant more than once. In the current implementation, we special case
the first time a cell is set, so that the value appears in that cell
immediately. There are various tricky aspects to doing this right,
including how to handle multiple writes before the cell is read.
However, in the new algorithm, writing a cell *always* has immediate
effect, and it's *always* safe to write multiple values to a cell, as
long as no one else has read it and the writes all take place in the
same rule. The current implementation makes these things happen only
for new cells; under the new algorithm, they should naturally fall
out of the "normal" way of doing things.
[08/21 - in discussion last week with Grant, it works out that we
also need for writes to a cell to always take precedence over the
rule's calculation, if both a write and recalc happen in the same
pulse. The rule must still *run* in order to update dependencies,
but its new value must be thrown out, and it must not have any side
effects. That is, the rule mustn't cause anything to be written to
the cell's undo log.]
To verify that, I'll need to review the algorithm in detail,
though. Heck, I need to review it in detail anyway, as the above is
more of a sketch of an idea than it is an actual algorithm.
The new idea is essentially that writes take place immediately, and
reads from a rule/cell that's doing the writing don't conflict. You
simply can't write to a cell if it has already been read in the
current pulse by some *other* cell. If that happens, we force a
rollback to the point where it was read, after elevating its layer to
be higher than the writer's layer. We also check to make sure that
the target cell hasn't been read (directly or indirectly) by its
writer, since that's a circular dependency.
Aha! Better idea: instead of aborting rules with an error when
something happens that we don't like, we just let them return. The
Trellis can sort out the errors after the function returns. We just
track which cells the rule reads and writes, along with the undo log
for that rule. Instead of rolling back changes when a target has
already been read, we simply allow the target to *re*-notify its
observers, which will recalculate them again. We only use the undo
log when recalculating a rule that made changes, so that side-effects
don't happen twice.
Likewise, if we read a value that might be dirty (i.e., it's on a
higher layer), that's okay because we'll get recalculated if it
results in a change.
In other words, the recalculation algorithm is roughly:
* put a cell on the priority heap
* while the heap is non-emtpy, pop a cell and recalculate it, keeping
an undo log for writes, and tracking what cells it reads and writes.
* update the cell's layer to reflect the cells it has read
* if its value changed, put its observers on the priority heap (with
their layers adjusted higher, if necessary)
* if it wrote to any cells, put those cells' observers on the
priority heap (with their layers adjusted likewise)
Well, that's not quite it. We don't put a cell on the priority heap
if it's flagged dirty, and we flag it dirty when we change one of its
dependencies.
If we read a cell that's flagged dirty, we recalculate it, and apply
the same rules.
A key difference from the current algorithm is that we can re-run the
same rule more than once, if necessary. @action rules still run only
once, though, as they'll be considered "layer infinity" (much the
same as they are now). We only re-run rules or undo their actions if
backtracking made a difference to the cells that those rules read.
To prevent infinite regress, we must disallow a cell from reading a
cell that it wrote, or writing a cell that it read, however
indirectly. Either that, or we need an arbitrary repetition limit
that counts how many times a rule has been recalculated. The latter
is probably less overhead and easier to implement than a more direct
way of detecting the problem. (We can't use Python's recursion
limit, since the algorithm is mostly non-recursive.)
One really interesting thing about the place where this algorithm has
wound up, is that it's not much different from traditional event
driven programming, *except* that external side-effects are
partitioned into @action rules, and the rest can be re-run as many
times as necessary to arrive at a consistent result, with
undo-ability for any intermediate results that are invalidated due to
a change in the overall graph structure.
Unfortunately, this is also a pretty major rework of the
algorithm. Fortunately, the algorithm is maybe only 20-25% of the
total peak.events.trellis module. All the high-level APIs could
remain largely untouched. The data structures (List, Set, Dict)
might also need some rework, but probably not much. We could keep
using the "future" mechanism to avoid needing an explicit "undo"
mechanism, since "future" effectively works by writing a value to a cell.
It would suffice to keep track of the value of the cell before it was
written, and if the rule is recalculated we simply restart it with
its previous "before" value. Similarly, when recalculating a rule
more than once, we want to restore that rule's previous "before"
value as well (so that self-referential rules remain consistent).
Whew. It's a big and potentially messy change, but we get to have
hubs and (kind of) imperative code. At the moment, it also seems
quite possible that the new algorithm will use less memory *and*
processor time than the current version, although the devil is
definitely in the details. More precisely, it's likely that a lot
*more* memory will be used during recalculation, but possibly quite a
bit less than the static memory requirements of the current
algorithm. And in non-pathological situations, it should do a lot
less dependency analysis.
I'll need to write another post to spell out the behavior
specifications in detail, though. This time I don't want any
under-specified corner cases popping up. So there will be a thorough
re-write of the current Specification.txt to describe the new
algorithm in excruciating detail. That re-write will also need to
cover the correct processing of receivers and discrete rules, as well
as how to handle poll() and repeat() properly in this new paradigm.
For multi-tasking, I had already given @action semantics to @task
methods. This is the right thing to do in the new algorithm simply
because we can't back a generator up a step! However, if we're
losing the ability to set values in @action rules, then we also lose
the ability to set them from generators, which potentially decreases
their usefulness. It may be that we want to keep the current
semantics of doing writes in @action and @task rules, i.e., that this
triggers a new round of calculation after the current one is
finished. That will also relate to how we handle repeat(), since
that has a similar effect.
More information about the PEAK
mailing list