[PEAK] Flashboard and the ODD update protocol

Phillip J. Eby pje at telecommunity.com
Sun Dec 4 17:51:15 EST 2005


Just a few quick notes on a concept for system monitoring and other kinds 
of co-ordinated status blackboards, that I call "flashboard".  In essence, 
the concept is a purely-distributed, fully-replicated transient database 
managed by a group of processes, communicating via Spread.  The data is 
managed by consensus using a simple protocol I call ODD, which stands for 
Offer-Dump-Delta, which are the three types of messages used.  Flashboard 
isn't currently implemented, so this is just a dump of my thoughts on how 
it could be implemented in a fairly generic way for a wide variety of 
specific data structures.  (As with so many other planned PEAK features, 
it'll be implemented when it's needed for a paying project.)

The processes participating in a Flashboard may be producers or consumers 
of data, or some combination thereof.  For system monitoring tasks, most 
will be the processes being monitored, and they will issue deltas for any 
changes in their status, as well as possibly participating in the 
Offer/Dump rituals that are used to catch up any newcomers monitoring the 
group.  Other participants in a system monitoring Flashboard might include 
statistical analyzers, simple monitoring clients, loggers, special-purpose 
analysis tools, etc.

If you think of a flashboard as being like a scoreboard, then the deltas 
represent changes to the scoreboard's content.  Each process also has its 
own view of what the scoreboard currently looks like - i.e., its 
state.  Deltas are sent to the group, and each process updates its 
scoreboard only from *received* deltas.  That is, processes do not apply 
their deltas directly; they send them to the group first, using Spread's 
"agreed ordering" messages so that each process ends up applying the deltas 
to their private scoreboard in the same order.

In principle, this is all you need to keep every process in sync with the 
same scoreboard contents.  In practice, members will come and go from the 
group, and they need to have a way to catch up with the current 
consensus.  This is where the Offer/Dump ritual comes in.

Whenever a process sees that one or more processes have just joined the 
group, it issues an Offer message describing how good its own history 
is.  In the simplest setup, this "goodness" can initially just be the 
timestamp of when the offering process joined the group.  The process that 
has been around longest will have seen the most.

When a process sees its *own* Offer message reflected back from the group, 
it immediately issues a Dump message containing its version of the state -- 
unless it has already seen an Offer message describing a state that's 
equivalent to or  better than its own.  This ensures that at most one Dump 
will be sent per distinct state, which will rapidly converge to a single 
"best" state as processes load any Dump that is better than their own state.

There is a tricky bit to this ritual, however.  A dump reflects some 
particular point in time, and processes may be issuing Deltas all the 
time.  If you load a Dump, you need to be able to tell which Deltas have or 
have not been incorporated into it.  That's one of the reasons for the 
Offer message.  Since Deltas are only processed upon receipt, and a Dump is 
issued in response to one's own Offer, and the order of all messages is 
agreed on by all members, it is easy to demonstrate that any Deltas 
received after the Offer are *not* incorporated into the Dump.

Thus, when a process receives an Offer it likes (i.e., for a state better 
than its own), it should begin queuing received Deltas rather than applying 
them, until it receives the corresponding Dump.  At that point, it can 
apply the queued Deltas after loading the Dump.  A timeout or maximum queue 
size is required to deal with the possibility that the offer's original 
sender died after sending its Offer but before sending a Dump.

All processes, regardless of whether they are currently waiting for a Dump, 
should recognize when an Offer is "better" than the one they are waiting 
for or already have, and switch to queueing Deltas while awaiting a 
corresponding Dump.  Likewise, if a process sees an Offer "worse" than its 
state, and it doesn't already have an Offer of its own outstanding, it 
should go ahead and issue an Offer of its own.  (This allows a process that 
failed to receive a Dump (because its offerer died between Offer and Dump) 
to "retry" by issuing an Offer of its own.)

Described like that, the overall protocol is quite simple, but reducing it 
to a precise state machine will be slightly more complex.  Luckily, the 
basic concept can be applied to many kinds of delta and state, and the 
protocol only needs to be able to tell if a given offer is "better, worse 
or equal" to another offer or to the current state.  It needs to be able to 
"apply" a delta, and "dump" or "load" a state.  The actual *content* of 
deltas, dumps, and offers can be entirely parameterized, so the basic 
protocol can be reused for a variety of application designs.

There are a couple of interesting special cases of the protocol, where you 
are only interested in Deltas or only Dumps.  For example, if you consider 
all states to be equally good, then Dumps will never need to be processed, 
and you simply apply deltas (e.g. a logging tool).  Another example is a 
tool that participates in two groups: in one group it processes deltas to 
build a higher-level summary of the activity as a state which it then 
shares in a different group.  However, the state is only changed by deltas 
from the low-level group, and the high-level group has no deltas at 
all.  Thus, a process can simply plug in to the high-level group to receive 
a current snapshot of the high-level summary as a read-only (no dumping) 
participant.

In practice, though, I suspect such a summarization process would be 
tricky, because it implies a dedicated process and a non-replicated state 
-- which probably means you should just have the summarization process just 
stick it in a database.  If the detail group keeps enough history, you can 
simply have a periodic monitor (e.g. a cronjob) grab the data from the 
group, update the summary database, and issue a delta to reset the 
history.  If a monitoring probe wants current data, it just combines the 
current database state with the shared state.

The principal limitations of Flashboard are in the need for the state to be 
small enough to be reasonably transmitted in a Spread message, and the 
essentially "unreliable" nature of the mechanism.  If there is ever a 
moment in which no processes belong to the group, or in which one or more 
new processes remain which not receive a successful hand-off from the 
previous members, then data is lost.  If a group encounters a net split, 
such that there are two or more subgroups unable to communicate with each 
other, the rejoined group will have different data for the "same" Offer 
description, and thus there may be inconsistent states within the group, 
perhaps indefinitely.  (There are possibly ways to combat this by adding 
more data to the Dump and/or Offer, to allow detecting inconsistent states.)

Thus, the concept is useful mainly for reflecting relatively-current 
states, such as "who's online now and what are they doing", "what queries 
are being processed now", etc.  In such system monitoring and "presence" 
applications, the splitting of the group or the absence of members is 
itself an indication of either useful information, or that the information 
is not required.  Also, the state to be managed in such cases is relatively 
simple.

Flashboard is therefore not a replacement for general-purpose messaging or 
queueing, nor is it a distributed replacement for a database.  It does, 
however, make it possible to embed fairly sophisticated monitoring into 
applications.  And, in the simplest case, the applications can simply issue 
Deltas and nothing else, leaving it to monitoring tools in the groups to 
build higher-level models of the current situation.  Simple web clients can 
also be used to map these higher level models to JSON (JavaScript Object 
Notation) for use by AJAX browser clients, resulting in a relatively 
lightweight way to browse and inspect the current state of running systems.

In addition to system monitoring tasks, some applications can benefit from 
"presence" functions like "who else is looking at this trouble ticket" or 
"what is my team doing right now", and these are also good uses for the 
Flashboard concept.




More information about the PEAK mailing list