[PEAK] Another way of looking at constraints and persistence

Fri Dec 10 00:46:01 EST 2004

[FYI: this is a long and rambling design post that ends up not 100% 
finalizing things.  But it *does* make significant progress towards knowing 
the architecture for persistence and constraints in peak.schema and the 
"workspace" model of transaction handling.  However, if you want to know 
what the final API will look like, it's not here yet. There will be at 
least one more post to write before I can get it fleshed out that far.]

The other day, I was going to implement my first-draft concept for 
peak.schema attributes and workspaces, when I had an interesting thought.

As I was mentally working out implementation details of the schema 
attribute objects, it occurred to me that there was really no difference 
between the ones I was envisioning, and the attribute descriptors that 
already exist in peak.binding, except for schema validation and event 
notification (for e.g. persistence and GUIs).

So, it then occurred to me, why not just treat schema information as pure 
constraints, entirely distinct from attribute 
implementation?  Interestingly, this would mean that schema validation 
could be applied to arbitrary components as well.

But how to implement this?  Well, as I thought about it, it occurred to me 
that since all attribute changes pass through __setattr__, it would be 
sufficient to have a __setattr__ that checks the attribute being set 
against the target object's schema.  This could be put in a 
'schema.Checked' class, and it could just use an attribute metadata 
registry to look up the constraints.  Further, 'Checked' objects could even 
prohibit setting of attributes not defined by their schema.

Now, this doesn't address changes to mutable attributes like 
lists.  However, if constraints can manipulate the value being set, they 
could replace a list being set on an attribute with a CheckedList that 
validates further changes.

So that leaves event support and persistence.  For persistence, we need two 
things: first, to be notified of all changes to an object, and second, to 
be able to supply it with lazily-loaded attributes.  Well, as it happens, 
__setattr__ and either __getattr__ or __getattribute__ should suffice for 
these requirements, if the workspace creates a dynamic subclass of the 
target class, with custom versions of these methods whenever it loads or 
creates an instance.

With these simple approaches, we can persist *any* object, without it 
having to be of a particular class or metaclass.  The only requirement is 
that it have an assignable __class__, which means we can't directly persist 
built-in types.  By the same token, we can also apply constraints to any 
object type (that has an assignable __class__), orthogonally to whether the 
type is persisted or not, whether it's a Component or not, or really 
anything else about it.

The secret ingredient here is metadata.  Before we had the metadata 
machinery, it would be a royal pain to generate a custom subclass to do 
constraint checking, because where would we get the constraints 
from?  Metadata was tied directly to implementation, so there was no choice 
in the matter.  But now, we can write something like:

     def checkObject(ob,callback=None):
         return ObjectConstraints[ob].check(ob,callback,in_place=True)

     class Checked(object):

         def __init__(self,*__args,**__kw):
             super(Checked,self).__init__(*__args,**__kw)
             checkObject(self)

         def __setattr__(self,attr,value):
             constraint = AttributeConstraints[self,attr]
             value = constraint.check(value) # check/transform value
             old = getattr(self,attr,NOT_GIVEN)
             super(Checked,self).__setattr__(attr,value)

And just mix 'Checked' into classes we want to always be checked.

In actual practice, though, the simple '.check()' method needs to be a bit 
more complex, or maybe there needs to be another key for the constraint 
registries.  Specifically, we need to know something about the context of 
the check, so that we can forgo checks that are not suitable (e.g. due to 
the need for a database query.)

On the other hand, it may be simple enough to say that there are 
attribute/item constraints, object constraints (i.e. involving more than 
one attribute of an object, or more than one item of a sequence), and 
system constraints (involving properties of the fact base as a whole).

But on the *other* other hand, I can see some constraints that are 
technically object constraints, but which should not be run in a "naive" 
way.  That is, they shouldn't just loop over a set of related objects in 
order to check some sequence-wide property, but instead should do a 
database query if they're going to be checked at all.  I can also see that 
some "system constraints" should not be checked on every transaction, but 
rather checked periodically by running queries to find exceptions.

So, for simple things, this three-way division is perfect, but in some 
cases it leaves something to be desired, because knowledge of when to check 
the constraint is sometimes based on implementation knowledge rather than 
domain knowledge.  Also, the actions to be taken when a constraint fails 
can differ based on application-specific criteria.  Although I've alluded 
in my code sample to having a callback for failing constraints (that 
defaults to raising an error), I don't think this is enough to accomodate 
all of the use cases.

I think that the appropriate separation, then, is still to categorize 
constraints in the three levels, but to use a separate generic function to 
actually check the constraints.  That is, instead of calling 
'aConstraint.check(aValue)', one would call a generic function like 
'aWorkspace.checkAttribute(aConstraint, aValue)' or 
'aWorkspace.checkWhileCommitting(aConstraint, aValue)', or something of 
that sort.  This would allow "implementation-specific" overrides as to how 
or whether a particular constraint should be checked.

That part of the concept still needs a bit more work, but could possibly do 
the trick.  In truth, system constraints aren't going to be checked on 
individual objects anyway, so even though we need a way to express them 
eventually, we don't need to worry about it for this stage of the 
design.  So that leaves us only with attribute and object constraints, 
which are different only in terms of what constraints you're likely to 
apply, and what objects they apply to.

So, the only real issue is, what constraint checks can and should be 
deferred for performance reasons?  Well, by their very nature, it would be 
checks that apply to global properties of a sequence or mapping.

Why is that?  Well, because an object's attribute schema is fixed, and 
therefore bounded.  So, only checks that require examining all the items in 
a sequence or mapping are capable of consuming large quantities of time, 
and should therefore be deferred.

Of course, in principle it's possible for there to be complex calculations 
that might take a long time, like say validating that an attribute is the 
Nth prime number for some large value of N.  :)  So, we don't *actually* 
have to limit deferral to global properties of collections.  But, it does 
mean that we actually can allow the constraints themselves to decide 
whether they should be deferred for batch processing.  With an appropriate 
mechanism, we could actually allow the constraints to add themselves and 
the object to be checked to a deferral map of some kind, so that batch 
processing need only check those constraints that haven't already been checked.

About the only scenario I can't seem to figure out is validating complex 
constraints in the background for e.g. a GUI application, since that leads 
to all sorts of "interesting" issues with threads, transaction conflict, 
and the like.  It seems like you'd have to have an explicit savepoint 
during which the user couldn't edit things.  On the other hand, maybe 
there's a simple way to just defer applying GUI changes until the 
background check completes.  That way, the GUI can be responsive, but it 
stops giving immediate validation while a batch validation is taking 
place.  As soon as batch validation finishes, the foreground changes can be 
applied, and foreground validation can resume.

Wow.  Pretty slick.  In fact, I think I know how to handle the deferral map 
thingy now too.  A constraint that wants to defer can call a method on the 
checked object's workspace.  The workspace can then either stick the 
constraint in a deferral map, or process it immediately, depending on its 
choice.  For example, an in-memory workspace (like the default workspace) 
would process it immediately.  More sophisticated workspaces could actually 
use a generic function to decide what to do with the thing.    For example, 
a relational workspace might segregate deferred constraints into those that 
must be run before DB writebacks occur, and those that can't be run unless 
the writebacks have already taken place.  But our basic implementations for 
now would just either run immediately, or defer until commit time.

I can hardly believe it's this simple, after literally years of thinking 
about how to do this "right".

Hm.  I think the actual implementation is going to need to delegate checks 
to the workspace from the get-go, rather than the way my draft above just 
checks things left and right.  I just remembered that for a web 
application, a web form is going to want to go ahead and make the changes 
to the object, and then create a list of errors/problems in order to 
annotate the result form.  Hmm.  On the *other* other other hand, probably 
all that's needed is a two-pass process...

Wait a sec.  I'm making this way too complicated.  ALL "object constraints" 
are deferred, at the __setattr__ level.  I mean, a constraint that involves 
more than one attribute of an object has to be deferred unless you always 
change one attribute before the other, and can do so in a way that the 
change is still valid.  So, in order to validate changes you make to an 
object, you *have* to decide when to check the object constraints, no 
matter what you do.  At least, you must check when you are asserting a 
stable state for the object.

For immutables, they reach stable state when __init__ is finished, so they 
can check then and need never check again.  For mutable objects, you only 
know you're "stable" when you either explicitly request a check, or you're 
committing a transaction.  Thus, explicit checking is best.  That check can 
include attribute-level checks, in fact, so the only reason to do 
attribute-level checks in __setattr__ is to get immediate error feedback.

An immediate error, however, *sounds* like a good idea, but in practice 
it's actually not that useful unless the error is a *programming* error, 
which it rarely will be.  Usually the source of error is user input, so you 
really need to just validate the user's input, and give them (non-error) 
feedback.  With the constraint registry in place, it's trivial to call a 
function to do a check on the value that you were "going to" set, and it 
makes a lot of sense to do so.  If you try to validate a non-existent 
attribute, you'll still get an error, since that *is* a programming error.

So, a UI would do two passes through the data.  First, it pre-validates the 
attribute values it intends to set.  If all the attributes pass 
individually, it then sets them all, and runs a check on the object as a 
whole to find any higher-level issues.

Meanwhile, a Checked object needs to keep track of whether it's "Good", 
"Bad", or "Unknown".  After __init__, the state is "Good".  If an 
object-level check fails, it becomes "bad".  If any attribute of the object 
changes, the state is "unknown".  If an object is in the "unknown" state at 
commit, it has to be checked.  If it's "good", or becomes "good" after a 
check, it can be committed.  If it's "bad" or becomes "bad" after the 
check, then the commit fails.  An object can still be "good" even if it has 
constraints that are deferred until DB writeback, since those constraints 
don't affect the writeback phase.

Interestingly, these characteristics mean that when persisting an object, 
we would want to mix in Checked, because then we can track an object's 
"dirty" state for purposes of doing writebacks.

So let me see if I can simplify and summarize:

* 'Checked' objects check attribute constraints at attribute set/delete time
* A given constraint can request the target object's workspace to defer 
checking it until later
* The workspace can go ahead and check immediately, or wait
* An explicit check can request that even deferred constraints should be 
checked immediately
* Inter-attribute constraints are only checked at an explicit check point, 
like transaction commit
* 'Checked' objects know whether they are currently "good", "bad", or 
"unknown", and only check object-wide constraints when they are "bad" or 
"unknown", or the check is supposed to immediately check deferred constraints.
* All checking operations (other than simple __setattr__ constraints) 
accept a callback for reporting constraint failures, so that UI tools can 
gather up a collection of all known errors, not just the first one found.
* Workspaces initiate non-deferred constraint checks prior to DB writeback, 
and then perform any checks that were deferred.
* Changing an object's workspace means that the new workspace must be able 
to treat the object as if its state were "unknown", even if it's considered 
"good".  (Because the previous workspace might have deferred checks 
registered for the object, that the new workspace doesn't know about.)
* When a transaction is aborted, dirty objects must have their classes set 
to a subclass that rejects all attempts at use, and all their state should 
be cleared.  (This is just a persistence thing, not related to 
constraints.  But it means that the error produced by attempting to commit 
with invalid data needs to contain plenty of information about the issue, 
for debugging purposes.)

Well, I think that about sums it up.  This constraint architecture allows 
immediate, deferred, and partially deferred constraints on objects and on 
their individual attributes or collection items, while allowing constraint 
failure information to be accumulated for immediate or batch display to a 
user.  And, with a bit more tweaking, it could even distinguish between 
levels of severity in violations.  That is, you could have "warnings" or 
"advisories", distinct from actual "errors".  More precisely, we could have 
Condition objects (per my previous Condition/Resolution post) that are 
attached to constraints, and get passed to the workspace for resolution.

Ah, yes.  That's actually much better.  A 'Checked' object can just keep 
track of its Conditions, or lack thereof, and also forward them to its 
workspace whenever a new Condition is added, such as a Condition meaning, 
"I'm changed and need to be written to the DB", or "I'm changed and haven't 
had a full check run on me yet".

Now, instead of running an explicit "check" per se, one actually just 
requests resolution of the object's current conditions.  Per attribute 
constraints are different, in that they are always checked immediately on 
__setattr__, and any error Conditions are resolved by immediately raising 
an error.  Or, more precisely, there are attribute pre-checks/transformers 
that validate the value before the attribute is set, and then there are 
Conditions that get set once the attribute has actually changed, like the 
"This object needs full validation" Condition.

Perfect!  The two-pass UI update algorithm still works with this, but now 
we can also apply the full dynamics of the Condition/Resolution concept to 
allow triggering any kind of business rules or "agents" when objects are 
changed.  We can even let domain-model methods create Conditions to 
indicate domain-level events of various kinds.  Those Conditions can then 
be trapped by workspace rules to produce UI effects, notifications, etc.

Of course, this now means we have to have a more detailed design of the 
Condition/Resolution system, which is still on the vague side.  Mainly, 
that means defining what principal kinds of conditions we need, because of 
course you can add whatever kinds you want later.  But we need to settle on 
basic kinds that cover the principal use cases.  Also, I need to flesh out 
the various (generic) methods on a workspace that will receive Condition 
notifications, the API for resolving Conditions, and so on.

I won't be doing that tonight, though, as I've spent the last four hours 
writing this and I need to eat something now.  :)