[TransWarp] PEAK persistence requirements and implementation thoughts
Phillip J. Eby
pje at telecommunity.com
Thu Jun 27 20:34:19 EDT 2002
Hi all. This is just a draft of my thoughts on requirements for the
(non-ZODB) database-backed persistence mechanism(s) to be provided by PEAK
for model.Elements, and how they might be implemented in a refactoring of
TW.Database.DataModel.
CRITICAL: Elements should be accessible via multiple unique-in-context
identifiers (e.g. uid, DN, numeric ID, UUID/GUID, etc.)
CRITICAL: Elements should be able to be based on multiple state objects,
from the same or different underlying databases. (But it's okay if this
requires more effort to configure than elements based on a single state
object.)
CRITICAL: Clients of an Element should not be required to participate in
persisting it. This means that such operations as an explicit .save()
should not be required. However, if such interactions are optional and can
be used to improve performance, it's acceptable to use them. (Many
persistence system designs use a save() call that forces a client to be
aware of when the object *should* be "saved". A better approach would be
an optional beginOp()/endOp() pairing, which keeps an operation nesting
count and when the object is outside an "operation", automatically saves
changes. This would allow clients and/or the object itself to manage its
save point collaboratively. That is, a method on an object that altered
multiple attributes could wrap them in a begin/end pair, preventing
excessive writes to an underlying DB.)
CRITICAL: It should be possible to cache Elements (and/or their underlying
states) with per-transaction (read/write) or cross-transaction (read-only)
lifetimes.
CRITICAL: Transaction rollback MUST invalidate the underlying state of all
Elements modified during the transaction. Ideally, this invalidation would
be transparent to any cached references to said state.
CRITICAL: "Intensional state". An Element should be able to refer to its
state in terms of a key. If a cached Element refers to a state object that
was cached across transactions, and the cached state has changed keys, the
cached Element should not re-use the cached state.
IMPORTANT: Ability to have different Elements share underlying database
state(s)
IMPORTANT: Ability to batch write operations to avoid unnecessary network
round-trips, SQL parse and DB locking overhead, etc. But any such batching
must be explicitly flushable, so that queries which expect the underlying
DB to reflect the changes will not see outdated state.
HELPFUL: Ability to automatically promote a cross-transaction cached state
to a per-transaction cached state upon an attempt to write to it. In an
environment which supports invalidation within a transaction, this could
possibly be accomplished by invalidation-on-read of any state which was not
loaded during the current transaction, if a read-write transaction is in
progress.
HELPFUL: Ordering of non-atomic operations should be able to be preserved
in transmission to the underlying storage mechanism, in order to fulfill
referential integrity constraints. (Note: in many data models this issue
can be avoided by doing all inserts with default null references, and by
preceding deletions with updating to null references. For data models with
non-null reference constraints on databases with non-deferrable constraint
checking, order preservation would be helpful.)
NICE TO HAVE: Ability for Elements sharing an underlying state to be
notified of changes; perhaps even loading/creating the other Element
"views" if they aren't currently in cache.
NON-REQUIREMENT: Threading. As with ZODB, persistent Elements are to be
considered unsafe for sharing among threads. Each thread should get its
own view/copy of a persistent Element or state.
The above requirements suggest to me an arrangement roughly like the
following...
"State" objects, similar to TW's old Record objects, maintain a pair of
mappings of field names to immutable atomic values (or tuples
thereof). One mapping represents "persistent" state, i.e., what is
currently in the underlying DB. The other represents "dirty" state, i.e.
what has been written but not yet sent to the DB.
State objects maintain an operation nesting count, and offer begin/end()
methods, which are called automatically by __setitem__, but can also be
called from an Element that's using the state object in order to batch
changes. When the nesting count goes down to zero, the state object asks
its manager to write its dirty state back to the DB. Also, whenever the
state object goes from non-dirty to dirty or vice versa, it informs its
manager, which will keep a collection of dirty states, so that it can flush
them when a query is attempted against the state manager. (It may be that
this is a shared collection at the DB, rather than per state
manager.) Invalidation should also remove the state from the "dirty"
collection.
Elements would maintain their own nesting count, passing through the
begin/end methods to their state(s) only when the count crosses the
zero/non-zero boundary. Domain- or view-level methods which manipulate
multiple features should wrap the manipulation in a begin/end pair, to
batch the operation. Both elements and states should be registered with
the current transaction if their nesting count has been touched within the
transaction. Upon abort, the nesting count can be reset to zero (and the
state invalidated), and upon commit a non-zero nesting count should cause
an exception to be raised.
An interesting side effect of the nesting protocol is that one could define
the domain-level "end" operation on the Element to perform domain
validation when the count reaches zero! This would allow "atomic"
operations to temporarily leave the object in a condition that does not
satisfy certain cross-feature constraints, but at the same time would not
write an unchecked state to the underlying DB!
Elements need to know their states, in order to pass through begin/end
messages. They also need to know how to look up their state objects, using
a primary key value (_p_oid?) or data obtained via their other state
objects. (Obviously, all this know-how has to be implemented at the
dispatch layer, since the domain layer is supposed to be oblivious to this
sort of thing.) When a state object is looked up, the Element must check
its own nesting count, and if positive, it should pass a "begin" message to
the newly-retrieved state. It will also need to subscribe to messages from
the state, regardless of current nesting count.
States must know their referencers, in order to pass them invalidation and
state-change messages. When an invalidation message is sent to a
persistent object, it should drop its reference (presumably a binding.Once
attribute) to the state that sent the message.
Handling multiple keys is a bit of a bother. Specialists need to cache
persistent objects within a transaction, to avoid creating new ones on
every reference attempt. Multiple keys means multiple caches, or one
multi-key cache. The tricky bit is that if the keys change, every cache
that contains the object may need to be updated. This suggests the need to
create a re-indexing subscriber/notification protocol, which could then be
used for caches at both the dispatch and data layers.
Index changes should be published only at zero nesting count, although in
theory this could introduce a race condition when keys are being
changed. In practice, it simply means that until the larger operation in
which the keys are being changed is complete, the object(s) are to be
referenced by their old keys. Because a state's nesting count can never be
zero when it has referencers with non-zero nesting counts, Elements don't
need to buffer key change messages from their states, as they'll only get
such messages when they themselves are at zero nesting level.
Persistent objects and states would both need to support the indexing
protocol, although their implementations might be different. Both would
basically work by keeping a list of keys, each a list of fields. The
dispatch layer implementation for Elements would also need to know which
state object each field came from. Whenever an event occurred which
modified one of the key fields, the cached Element would request re-caching
by its container.
If invalidation occurs for a state that an Element is using to supply part
of one of its keys, the Element must immediately request to be un-indexed
on that key, thus ensuring that the Element won't stay cached under an
invalid key. Conversely, when an Element loads a state that provides
fields for a key, it should try to index itself on that key, if all the
states that contribute to the key are now present. Presumably, an Element
should just implement methods to "set" and "unset" state objects, that
would be used by their Once attribute loaders, by Specialists, and by the
invalidation operation of state objects.
Specialists may have some difficulty retrieving Elements cached by fewer
than all of their keys (e.g., because one or more of the Element's state
objects are not currently loaded). If a cache miss occurs on lookup, the
Specialist will look up the state needed for retrieval by that key, and
then map over to any other keys available from the state object to re-check
the cache. If a hit is found, the Element can be given the state object
for future reference. Otherwise, a new Element instance should be created,
registered with the cache, and given the state object for future
reference. (Note that this assumes that all states for an Element have at
least one key in common! For all our current use cases, this is true, but
the assumption should still be documented.)
Much of the basis for implementing the indexing and other protocols exists
in TW's DataModel module already, but would need to be
re-written. Unfortunately, this isn't all the refactoring that DataModel
needs. The type machinery needs an overhaul as well, and there are some
potential issues with handling Record non-existence. (But I'll save these
matters for another posting.)
The probable location of the base persistence mechanisms in PEAK will be in
'peak.model.database' and 'peak.model.persistent'. Usage in the dispatch
layer would look something like this:
from peak.api import *
#override default model with database-backed model
import peak.model.database as model
import MyDomainLayer
__bases__ = MyDomainLayer,
# ... proceed to specify dispatch-layer mappings to state objects
The difference between 'peak.model.database' and 'peak.model.persistent' is
that the first would be for state-based persistence, using SQL, LDAP or
similar databases, and the latter would be for ZODB-backed direct object
persistence. Using the latter model, the dispatch layer would mainly
consist of specifying where to use BTrees and the like in place of
PersistentList or PersistentMapping. Its mechanisms will likely be quite
different from those of 'peak.model.database'.
Items to be discussed in future postings:
* Type machinery for states, and the necessary interactions with caching,
invalidation, "not found" states, and creation of states
* Index key metadata and cache protocol details
* Transaction responsibilities of Specialists, Elements, States,
StateManagers, etc. Who does what? Who's the _p_jar for each? How can we
get the invalidate-on-write-if-out-of-date machinery to actually work?
* Mechanisms for specifying Element.Field -> State.Field and
Element.Collection -> Specialist.query() mappings and transformations in
the dispatch layer, with the least amount of typing required for common cases.
* Mechanisms for mapping Specialist.query -> StateManager.query and
resulting state objects -> Element.states.
More information about the PEAK
mailing list