[PEAK] Laziness, validation, and other TrellisDB thoughts

Tue Aug 21 20:47:49 EDT 2007

So, it's long past time for me to get down to brass tacks on the 
TrellisDB design.  I've had a lot of general thoughts, but a lot of 
those thoughts need nailing down or tossing out.

First off, I've been split between a mental model of old-style schema 
APIs (like PEAK's peak.model and the Chandler application.schema 
system), or simply using trellis.Component objects to model entities.

 From a purely conceptual standpoint, using trellis.Component instead 
of a specialized persistence type is an enormous win.  You can have a 
"pure" domain model component, and manipulate it without even needing 
a place to store it, let alone a schema for storing it.  You can 
manage schema information in a way that's 100% independent (although 
you still need a schema, and some way to know what back-end is in use).

 From a practical perspective, however, there is a breakdown in two 
areas: laziness and validation.

Laziness
--------

Let's tackle laziness first.  When you access an object, you may not 
need all of its data.  For example, if you're displaying a list of 
emails, you don't want to have to load the entire text of each one 
off of the disk.

Second, and even more critical, is that any attributes that are links 
to other objects *must* be lazy, because otherwise accessing any 
object will end up loading the entire graph of all related objects into memory!

So, it doesn't suffice to just use ordinary cells slapped into a 
component's __cells__, in order to load a component from a storage 
back-end.  For multi-valued attributes (such as sets, sequences, or 
mappings), the actual data structure can be lazy, so they're not too 
big of a deal.

It's the single-valued attributes, such as a blob field or a 
reference to a single related object, that are tricky.  The standard 
trellis Cell types won't work, as they will automatically calculate 
their value after creation.  And making the attributes optional 
doesn't work, because the component type doesn't know (and doesn't 
*want* to know) how to look up the missing field(s) in a database.  Not good.

So, we're going to have to have an explicitly "lazy" cell type, or at 
least a way to make a cell with deferred initialization.  Or perhaps 
just an object that has the same interface as a cell, but doesn't 
cache a value and instead delegates its operations to something else, 
just like the attributes based on lists or dictionaries or sets.

This, by the way, is why my Spike prototype for Chandler internally 
represented *all* attributes as sets, and simply had descriptors that 
presented single-valued attributes by retrieving the one value from 
the underlying set.  That way, a backend need only provide virtual 
sets in order to implement any attribute.

So, we could actually do the same thing in TrellisDB, just using a 
cell-like object that wraps a set.  But it doesn't really matter; the 
point is just that a TrellisDB backend is responsible for providing 
cells (or cell-like objects) and slapping them all into a Component.

Well, that's not precisely correct.  A backend is responsible for 
providing cells (or lists, sets, etc.) that it's asked for.  A 
front-end (middle end?) layer will be responsible for putting it all 
into a Component, and for asking backend parts to load their respective data.

Well, load isn't really the right word.  The backends will just be 
asked for objects that *will* do the loading, when they're required to.

Validation
----------

In a pure Trellis world (i.e., without TrellisDB), the simplest way 
to do validation is to just write a rule that lists all current 
errors in an object's state -- all of its constraint violations, in 
other words.

In a larger system, however, this introduces some problems.  First, 
some validation rules aren't really object-level, like "every contact 
must have an email address unique to that contact".  Second, what if 
a validation rule refers to a field that we want to load lazily?

One simple fix for the second issue is to just mark the validation 
rule(s) @optional, since they are only really needed when editing an 
object or trying to save it.  However, once you've edited a given 
object, its validation rules are going to hang around, and keep 
referring to whatever data they used, unless there's some way to 
garbage collect that.  (Which we'll talk more about later).

For the first issue, we may want to define validation rules in terms 
of *queries*, and use query-level notifications to determine whether 
a violation has occurred.  That is, if a change to the object causes 
an addition to a set of "objects with errors", then the change is 
responsible and should be rejected.

Interestingly, this same approach could be used for field-level 
validation... but probably shouldn't.  The big problem with 
query-based validation is that it requires changes to be forwarded to 
the DB service (even if those changes aren't actually written 
out).  That makes them impractical for use as live notifications 
during editing, and suitable only for rejecting save operations.

Actually, that's not really true.  It's not necessary for changes to 
be forwarded to the DB service.  The system could simply create 
virtual sets that bridge virtual records representing an object's 
"current changes" (or "state-to-be") with actual DB state.

Or, alternately -- and this makes somewhat more sense -- we can 
simply have the DB service keep track of what records are changed, 
and modify query results accordingly.  This would make a query-based 
validation system work splendidly, at the cost of some issues with 
cache coherency.

That is, if you have some sort of query, like let's say a count of 
tasks in various statuses, and you modify the status of an item, then 
the count has to be updated.  Now, we can quite possibly make the 
derived sets generate appropriate change events, and thereby cleanly 
update the display.  But suppose that for some reason we need to 
iterate over the query's rows?

If the query is backed by a database, and the database isn't 
up-to-date, it is going to produce an inconsistent result.  The 
counts returned will not reflect the change in status that hasn't 
been written back to disk yet.

This strongly suggests that to do validation, one needs a different 
"view" of the data, that is isolated from the remainder of the 
system.  Validation queries should explicitly merge the data being 
edited with the current stable state, strictly within the current 
editing context.

This implies, of course, that there is such a thing as an editing 
context!  We need to distinguish between an object that we're just 
reading, and one that we intend to edit.  This could be done at 
"load" time (which isn't really load time) or by asking for editing 
later.  We would also need to be able to save or cancel the changes, 
as well as get a set object that will hold validation data (and 
provide validation events).

O-O, Persistence, and Context
-----------------------------

So far, some bits of this don't sound very much like an object 
persistence system.  Personally, I think that's okay, as transparent 
persistence is highly overrated.  OO mixes code and data, which is a 
bad thing if your data is expected to outlive the application.

In the context of long-lived data, OO mostly gives you API 
convenience and little else.  If you couple your objects to your 
external data schema, you are just begging for the headache known as 
schema evolution.  That's why Chandler has EIM (external information 
model) for storing data that's expected to outlive a given version of 
the app, or to communicate with other systems (such as Chandler 
Server, which is written in Java).

So, the TrellisDB approach is going to require that you explicitly 
ask a DB to find you an object (or objects), explicitly begin making 
changes to it, and explicitly save those changes.  However, for 
convenience, it's likely that the Contextual library will be used to 
provide automatic tracking and saving, using "with:" blocks in Python 
2.5.  Contextual already has a notion of scopes, and resources that 
are acquired within a given scope being automatically cleaned up or 
released afterward.

The 'context.Action' scope in particular is intended for doing 
transaction-like things, and should be adaptable to this 
purpose.  One of the intended functions of context.Action is to 
support database connection pooling as well as transactions, so it's 
likely that TrellisDB will be wanting to use it anyway.

But that brings an interesting issue into play, which is that an 
event-driven GUI application is going to need the action scope to 
hang around for perhaps some time.  More properly, it needs the 
editing scope to hang around, which I suppose isn't quite the same 
thing.  It also isn't going to play well with my connection pooling 
plans for context.Action.

To try and put it more clearly: editing scope is not transaction 
scope.  Transactions want to be short-lived, but actual human editing 
is slow.  So editing context has to be long-lived, even if during the 
process of validation it's necessary to temporarily grab a DB 
connection and do a short-lived read-only transaction in order to 
look up or validate something.

This means that TrellisDB backends will need to be able to 
temporarily wrap themselves in an Action, if there isn't one active 
in the context where they "live".  If done correctly, this would also 
allow saves to automatically do db-level commits.

Implementation Strategy
-----------------------

So at this level, there's a big ol' hairy pile of things left to get 
specified.  We haven't talked schema yet, let alone the detailed 
mechanics of mapping between the schema and the backends.  We don't 
know whether caches can be GC'd properly or not.  We don't have 
change conflict prevention (so that updates to the DB that happen 
during a long edit aren't lost).  Heck, we don't even have a way to 
ensure that active queries are updated when somebody changes the 
database underneath us.

On the other hand, a few things *are* clear:

* Retrieving, editing, and saving will be explicit operations

* We will need lazy cells and lazy data structures

* It's possible to make a purely query-based validation system

The first of these points is *very* useful, because it means that one 
can start writing code using services that just explicitly find, 
edit, and delete objects, using direct database mapping code, without 
waiting for TrellisDB as a whole to land.

Instead, TrellisDB can provide basic virtual data types for sets, 
indexed sets (i.e., lists) and other needed data structures.  Then, 
specific storage services (akin to old-style PEAK DataManagers) can 
simply use instances of these basic types, or specialized subclasses, 
in order to act as "backends" implementing domain-specific functionality.

This will allow us to get a feel for what services would actually be 
useful in a fully-generic "database" service and schema subsystem.

So, the basic strategy will be to implement base set types for things 
like union, intersect, join, difference, filter, map, "splitters", 
and so on.  These pieces can then be combined using domain-specific 
(and schema-specific) code, until the schema subsystem and query 
language are in place.  There will probably also be some support 
services based on Contextual; if nothing else, some sort of base 
class for persistence services.