[PEAK] Possible performance enhancements for the Trellis
Phillip J. Eby
pje at telecommunity.com
Mon Aug 6 13:46:02 EDT 2007
I've been thinking a bit about ways to further enhance Trellis
calculation performance, and have thought of two things that could
perhaps be added to take advantage of the fact that most Trellis
rules are not recalculated in a given pass, and the fact that most
Trellis rules have a very stable access pattern, in terms of which
cells they access.
The current algorithm schedules cells to be potentially recalculated,
and when a cell's value is requested, it looks "down" the dependency
hierarchy to see whether any of its dependencies have changed.
But, if we do things in a slightly different way, a cell should be
able to tell in most cases whether it's up-to-date without looking at
*any* of its dependencies.
Let's suppose that instead of tracking a "version" (what pulse the
cell is up-to-date as of) and "changed_as_of" (what pulse the cell
changed as of), we simply track whether a cell is "dirty" (had a
dependency changed) and what "layer" it is in.
A cell is initially in layer zero, and its layer is reset to zero any
time a value is explicitly written to it. When a cell's rule is
calculated, its layer is increased until it is greater than the
highest layer of any of its dependencies. So, a rule that reads only
layer zero cells will have a layer of one. If it later reads a cell
from layer 2, its layer will become 3. After that, however, reading
more layer zero cells will not drop its layer to 1; it will remain a
layer-3 cell. (The idea here is to minimize the amount of
layer-changing a cell does in its lifetime.)
When a cell's value changes, it sets the dirty flag of its observers,
and adds them to a priority heap (instead of the current simple
list), ordered by layer. The cleanup algorithm then runs in priority
order, so that recalculation is done breadth-first through the layers.
Thus, a cell knows that:
1. It is definitely dirty if its flag is set; it should recalculate
itself, notify its observers if needed, and then clear its dirty flag
2. It is definitely NOT dirty if its flag is not set, and the *first
item in the priority heap is on the same or higher layer* (or the
heap is empty). Since any dependency of a cell is always on a lower
layer than that of the observer, then if there are no such cells on
the queue, then all of this cell's dependencies already had a chance
to mark this cell dirty -- and didn't. Therefore, the cell is clean,
and doesn't need to make any changes. (The current algorithm has to
set the _version attribute equal to the current pulse to mark itself clean.)
3. It is *maybe* dirty if its flag is not set, and the first item on
the priority heap is on a lower layer than the current cell's
layer. In this case, we must recursively check the dependencies,
just as in the current algorithm. Except that we only need to
recursively check dependencies that don't fall into one of the
preceding two categories, so there are fewer branches and shallower
tests. In addition, it should be very rare that the recursion is
required, since the only time this condition can occur is if a cell
rule requests a value that was not previously a dependency. (Because
if it was previously a dependency, then the current cell would have a
higher layer number, and one of the other two conditions would have applied.)
I think this approach will be a strong win for performance in large
Trellises, as the current algorithm *always* does a recursive
dependency scan when a rule reads a value that hasn't yet been
scanned for changes in the current pulse. That means that a lot of
cells get looked at -- almost as many as if you were actually
recalculating everything. This new approach would minimize pointless
scanning, and possibly open the doorway for a better approach to
"hub" cells that receive data from a large number of child cells (but
which you don't want to have looping over a whole bunch of inputs as
dependencies).
The second way to take advantage of a Trellis' stability to improve
performance is the fact that most rules will always look at the same
cells over and over. But the current algorithm builds a new list of
dependencies (and weak observer links) every time there's a
recalc. As a side effect, this also means that various objects (such
as weakrefs and a "Mailbox" object) get created and discarded for each recalc.
However, if the dependencies are the *same*, then there is no point
in changing anything -- the existing dependencies could just be left
alone. In addition, the code that creates observer links could just
record the dependency links, and the recalc code could set up the
backlinks, deleting ones that are no longer needed, and adding the
ones that are new. So only the bare minimum number of weakrefs would
be created, and the elaborate current system of doubly-indirected
weakrefs could go away -- eliminating the need to have a separate
Mailbox object as a weakref target, and thereby saving both time and space.
Neither of these algorithm and data structure changes would affect
the public API, so I'm not sure if I'll bother with them for the
0.5b1 release, unless I end up reconsidering the way the Trellis
makes "future" changes.
Specifically, I came up with the idea of the priority heap while
thinking about ways to build a solid mechanism for TrellisDB to keep
track of changes to domain objects and map them back to a
record-oriented model.
I'm beginning to realize that the Trellis' current model for managing
writes in rules (i.e., writing to a cell causes a new time pulse to
occur) is a bit more limited than I'd like it to be. The advantage
of a write-oriented model is that it handles small changes to large
datasets better than the current trellis does. However, you can get
similar effects by using a good "hub".
Consider a database service that maintains the mapping between domain
objects and their corresponding records. When a domain object
changes, you want the service to know what record changes will
correspond, and in turn the actual record-based services need to
update the query sets that contain the before and after versions of
those records.
The "normal" way to do this is that you'd write code to update the
record sets when a change occurs, but It's Not The Trellis
Way(TM). The Trellis way to do this would be that you have a rule
that loops over the domain objects in memory to create a record-level
diff of all the current changes. But that loop is very inefficient,
and quickly gets more so.
Enter the Hub. I've talked about this idea before, and ended up
discarding it in favor of the todo/future approach that's used in the
current SVN version (and isn't yet fully documented in the
tutorial). However, I'd forgotten about the TrellisDB use cases
while I was working on that. (The current stuff is still useful,
just not as much so for TrellisDB.)
So, I'm thinking that a Hub cell will be something like a dictionary
of cells, with an output cell that's a dictionary. You add cells to
the hub, and it registers them as dependencies of the output
cell. But instead of it actually looping over those cells, it just
takes the notifications it gets, and uses them to assemble a
dictionary (or set, or list, or whatever) that contains the values of
the changed cells. You could then have rules that use the changes to
do things... like updating databases.
In a way, it's like writing callbacks to set values -- except that
there isn't really any value setting -- it's just a more efficient
way of doing the naive "loop over everything".
And of course, for this to work efficiently, it means that asking for
the output of a hub shouldn't require looping over all its inputs,
even under the hood. Hence, the priority heap idea. That way, as
long as you update the hub's layer whenever you add an input cell to
it, the hub will always live at a layer above its inputs, which means
it should never need to *actually* loop over its inputs.
In practice, I suspect there may be occasional "devil in the details"
exceptions to that idea. More on this in another email, as I refine
the priority heap algorithm a bit. The main remaining hurdle is that
it's difficult to clear junk from the heap when a layer inversion
occurs -- that is, when a cell asks for a higher-layered value it's
never used before. If cell rules were required to be side-effect
free (or at least able to handle a "Rescheduling" exception), then we
could simply cancel the calculation of a layer-inverting rule, and
reschedule it to the layer above the value it's requesting.
To avoid output rules being re-run, they'd need to be marked as
outputs, and placed at the highest possible layer. Depending upon an
output rule would then be an error, and outputs would once again be
guaranteed to see a fully consistent state.
In order for this to work, it would also be necessary to ban
non-output cells from making modifications to other cells. In other
words, non-output rules would never be able to change anything. A
bit of experimenting with the current SVN version of the Trellis
shows that only two very tricky test examples in Specification.txt
fail this requirement, and they could easily be discarded.
With this being the case, I'm thinking that in 0.5b1 I will go ahead
and implement an even stronger restriction: that *no* rule may make
modifications to *anything*. That way, if at a later point we go
with a heap-based recalc algorithm that can retry/reschedule rules,
it won't be necessary to go back and modify existing code. That is,
we'll already be used to the restrictions.
I suppose I should also allow you to mark rules as outputs in 0.5b1,
even if it doesn't do anything special right now. That way, the
tutorial and examples will all be suitable for the
re-entrant/non-recursive algorithm.
One final hurdle: generator/task rules. Such rules *can't* be
re-entered, so having a Reschedule exception is a non-starter. That
means tasks would have to always be treated as "output" rules, and
thus can't be used as inputs to other rules. I think this is
probably an okay restriction, but I'll need to make some changes to
the current implementation so that it doesn't actually produce any
output values as a side effect of yielding.
I think that this limitation on tasks will be okay, because I was
already finding that the value-yielding aspect of tasks was going to
be difficult to explain and even more difficult to use
correctly. Restricting tasks to being only useful for their
side-effects (such as socket transmissions, database updates, etc.)
fixes both of those problems.
More information about the PEAK
mailing list