[PEAK] Laziness, validation, and other TrellisDB thoughts
Phillip J. Eby
pje at telecommunity.com
Tue Aug 21 20:47:49 EDT 2007
So, it's long past time for me to get down to brass tacks on the
TrellisDB design. I've had a lot of general thoughts, but a lot of
those thoughts need nailing down or tossing out.
First off, I've been split between a mental model of old-style schema
APIs (like PEAK's peak.model and the Chandler application.schema
system), or simply using trellis.Component objects to model entities.
From a purely conceptual standpoint, using trellis.Component instead
of a specialized persistence type is an enormous win. You can have a
"pure" domain model component, and manipulate it without even needing
a place to store it, let alone a schema for storing it. You can
manage schema information in a way that's 100% independent (although
you still need a schema, and some way to know what back-end is in use).
From a practical perspective, however, there is a breakdown in two
areas: laziness and validation.
Laziness
--------
Let's tackle laziness first. When you access an object, you may not
need all of its data. For example, if you're displaying a list of
emails, you don't want to have to load the entire text of each one
off of the disk.
Second, and even more critical, is that any attributes that are links
to other objects *must* be lazy, because otherwise accessing any
object will end up loading the entire graph of all related objects into memory!
So, it doesn't suffice to just use ordinary cells slapped into a
component's __cells__, in order to load a component from a storage
back-end. For multi-valued attributes (such as sets, sequences, or
mappings), the actual data structure can be lazy, so they're not too
big of a deal.
It's the single-valued attributes, such as a blob field or a
reference to a single related object, that are tricky. The standard
trellis Cell types won't work, as they will automatically calculate
their value after creation. And making the attributes optional
doesn't work, because the component type doesn't know (and doesn't
*want* to know) how to look up the missing field(s) in a database. Not good.
So, we're going to have to have an explicitly "lazy" cell type, or at
least a way to make a cell with deferred initialization. Or perhaps
just an object that has the same interface as a cell, but doesn't
cache a value and instead delegates its operations to something else,
just like the attributes based on lists or dictionaries or sets.
This, by the way, is why my Spike prototype for Chandler internally
represented *all* attributes as sets, and simply had descriptors that
presented single-valued attributes by retrieving the one value from
the underlying set. That way, a backend need only provide virtual
sets in order to implement any attribute.
So, we could actually do the same thing in TrellisDB, just using a
cell-like object that wraps a set. But it doesn't really matter; the
point is just that a TrellisDB backend is responsible for providing
cells (or cell-like objects) and slapping them all into a Component.
Well, that's not precisely correct. A backend is responsible for
providing cells (or lists, sets, etc.) that it's asked for. A
front-end (middle end?) layer will be responsible for putting it all
into a Component, and for asking backend parts to load their respective data.
Well, load isn't really the right word. The backends will just be
asked for objects that *will* do the loading, when they're required to.
Validation
----------
In a pure Trellis world (i.e., without TrellisDB), the simplest way
to do validation is to just write a rule that lists all current
errors in an object's state -- all of its constraint violations, in
other words.
In a larger system, however, this introduces some problems. First,
some validation rules aren't really object-level, like "every contact
must have an email address unique to that contact". Second, what if
a validation rule refers to a field that we want to load lazily?
One simple fix for the second issue is to just mark the validation
rule(s) @optional, since they are only really needed when editing an
object or trying to save it. However, once you've edited a given
object, its validation rules are going to hang around, and keep
referring to whatever data they used, unless there's some way to
garbage collect that. (Which we'll talk more about later).
For the first issue, we may want to define validation rules in terms
of *queries*, and use query-level notifications to determine whether
a violation has occurred. That is, if a change to the object causes
an addition to a set of "objects with errors", then the change is
responsible and should be rejected.
Interestingly, this same approach could be used for field-level
validation... but probably shouldn't. The big problem with
query-based validation is that it requires changes to be forwarded to
the DB service (even if those changes aren't actually written
out). That makes them impractical for use as live notifications
during editing, and suitable only for rejecting save operations.
Actually, that's not really true. It's not necessary for changes to
be forwarded to the DB service. The system could simply create
virtual sets that bridge virtual records representing an object's
"current changes" (or "state-to-be") with actual DB state.
Or, alternately -- and this makes somewhat more sense -- we can
simply have the DB service keep track of what records are changed,
and modify query results accordingly. This would make a query-based
validation system work splendidly, at the cost of some issues with
cache coherency.
That is, if you have some sort of query, like let's say a count of
tasks in various statuses, and you modify the status of an item, then
the count has to be updated. Now, we can quite possibly make the
derived sets generate appropriate change events, and thereby cleanly
update the display. But suppose that for some reason we need to
iterate over the query's rows?
If the query is backed by a database, and the database isn't
up-to-date, it is going to produce an inconsistent result. The
counts returned will not reflect the change in status that hasn't
been written back to disk yet.
This strongly suggests that to do validation, one needs a different
"view" of the data, that is isolated from the remainder of the
system. Validation queries should explicitly merge the data being
edited with the current stable state, strictly within the current
editing context.
This implies, of course, that there is such a thing as an editing
context! We need to distinguish between an object that we're just
reading, and one that we intend to edit. This could be done at
"load" time (which isn't really load time) or by asking for editing
later. We would also need to be able to save or cancel the changes,
as well as get a set object that will hold validation data (and
provide validation events).
O-O, Persistence, and Context
-----------------------------
So far, some bits of this don't sound very much like an object
persistence system. Personally, I think that's okay, as transparent
persistence is highly overrated. OO mixes code and data, which is a
bad thing if your data is expected to outlive the application.
In the context of long-lived data, OO mostly gives you API
convenience and little else. If you couple your objects to your
external data schema, you are just begging for the headache known as
schema evolution. That's why Chandler has EIM (external information
model) for storing data that's expected to outlive a given version of
the app, or to communicate with other systems (such as Chandler
Server, which is written in Java).
So, the TrellisDB approach is going to require that you explicitly
ask a DB to find you an object (or objects), explicitly begin making
changes to it, and explicitly save those changes. However, for
convenience, it's likely that the Contextual library will be used to
provide automatic tracking and saving, using "with:" blocks in Python
2.5. Contextual already has a notion of scopes, and resources that
are acquired within a given scope being automatically cleaned up or
released afterward.
The 'context.Action' scope in particular is intended for doing
transaction-like things, and should be adaptable to this
purpose. One of the intended functions of context.Action is to
support database connection pooling as well as transactions, so it's
likely that TrellisDB will be wanting to use it anyway.
But that brings an interesting issue into play, which is that an
event-driven GUI application is going to need the action scope to
hang around for perhaps some time. More properly, it needs the
editing scope to hang around, which I suppose isn't quite the same
thing. It also isn't going to play well with my connection pooling
plans for context.Action.
To try and put it more clearly: editing scope is not transaction
scope. Transactions want to be short-lived, but actual human editing
is slow. So editing context has to be long-lived, even if during the
process of validation it's necessary to temporarily grab a DB
connection and do a short-lived read-only transaction in order to
look up or validate something.
This means that TrellisDB backends will need to be able to
temporarily wrap themselves in an Action, if there isn't one active
in the context where they "live". If done correctly, this would also
allow saves to automatically do db-level commits.
Implementation Strategy
-----------------------
So at this level, there's a big ol' hairy pile of things left to get
specified. We haven't talked schema yet, let alone the detailed
mechanics of mapping between the schema and the backends. We don't
know whether caches can be GC'd properly or not. We don't have
change conflict prevention (so that updates to the DB that happen
during a long edit aren't lost). Heck, we don't even have a way to
ensure that active queries are updated when somebody changes the
database underneath us.
On the other hand, a few things *are* clear:
* Retrieving, editing, and saving will be explicit operations
* We will need lazy cells and lazy data structures
* It's possible to make a purely query-based validation system
The first of these points is *very* useful, because it means that one
can start writing code using services that just explicitly find,
edit, and delete objects, using direct database mapping code, without
waiting for TrellisDB as a whole to land.
Instead, TrellisDB can provide basic virtual data types for sets,
indexed sets (i.e., lists) and other needed data structures. Then,
specific storage services (akin to old-style PEAK DataManagers) can
simply use instances of these basic types, or specialized subclasses,
in order to act as "backends" implementing domain-specific functionality.
This will allow us to get a feel for what services would actually be
useful in a fully-generic "database" service and schema subsystem.
So, the basic strategy will be to implement base set types for things
like union, intersect, join, difference, filter, map, "splitters",
and so on. These pieces can then be combined using domain-specific
(and schema-specific) code, until the schema subsystem and query
language are in place. There will probably also be some support
services based on Contextual; if nothing else, some sort of base
class for persistence services.
More information about the PEAK
mailing list