[PEAK] Another way of looking at constraints and persistence
Phillip J. Eby
pje at telecommunity.com
Fri Dec 10 00:46:01 EST 2004
[FYI: this is a long and rambling design post that ends up not 100%
finalizing things. But it *does* make significant progress towards knowing
the architecture for persistence and constraints in peak.schema and the
"workspace" model of transaction handling. However, if you want to know
what the final API will look like, it's not here yet. There will be at
least one more post to write before I can get it fleshed out that far.]
The other day, I was going to implement my first-draft concept for
peak.schema attributes and workspaces, when I had an interesting thought.
As I was mentally working out implementation details of the schema
attribute objects, it occurred to me that there was really no difference
between the ones I was envisioning, and the attribute descriptors that
already exist in peak.binding, except for schema validation and event
notification (for e.g. persistence and GUIs).
So, it then occurred to me, why not just treat schema information as pure
constraints, entirely distinct from attribute
implementation? Interestingly, this would mean that schema validation
could be applied to arbitrary components as well.
But how to implement this? Well, as I thought about it, it occurred to me
that since all attribute changes pass through __setattr__, it would be
sufficient to have a __setattr__ that checks the attribute being set
against the target object's schema. This could be put in a
'schema.Checked' class, and it could just use an attribute metadata
registry to look up the constraints. Further, 'Checked' objects could even
prohibit setting of attributes not defined by their schema.
Now, this doesn't address changes to mutable attributes like
lists. However, if constraints can manipulate the value being set, they
could replace a list being set on an attribute with a CheckedList that
validates further changes.
So that leaves event support and persistence. For persistence, we need two
things: first, to be notified of all changes to an object, and second, to
be able to supply it with lazily-loaded attributes. Well, as it happens,
__setattr__ and either __getattr__ or __getattribute__ should suffice for
these requirements, if the workspace creates a dynamic subclass of the
target class, with custom versions of these methods whenever it loads or
creates an instance.
With these simple approaches, we can persist *any* object, without it
having to be of a particular class or metaclass. The only requirement is
that it have an assignable __class__, which means we can't directly persist
built-in types. By the same token, we can also apply constraints to any
object type (that has an assignable __class__), orthogonally to whether the
type is persisted or not, whether it's a Component or not, or really
anything else about it.
The secret ingredient here is metadata. Before we had the metadata
machinery, it would be a royal pain to generate a custom subclass to do
constraint checking, because where would we get the constraints
from? Metadata was tied directly to implementation, so there was no choice
in the matter. But now, we can write something like:
def checkObject(ob,callback=None):
return ObjectConstraints[ob].check(ob,callback,in_place=True)
class Checked(object):
def __init__(self,*__args,**__kw):
super(Checked,self).__init__(*__args,**__kw)
checkObject(self)
def __setattr__(self,attr,value):
constraint = AttributeConstraints[self,attr]
value = constraint.check(value) # check/transform value
old = getattr(self,attr,NOT_GIVEN)
super(Checked,self).__setattr__(attr,value)
And just mix 'Checked' into classes we want to always be checked.
In actual practice, though, the simple '.check()' method needs to be a bit
more complex, or maybe there needs to be another key for the constraint
registries. Specifically, we need to know something about the context of
the check, so that we can forgo checks that are not suitable (e.g. due to
the need for a database query.)
On the other hand, it may be simple enough to say that there are
attribute/item constraints, object constraints (i.e. involving more than
one attribute of an object, or more than one item of a sequence), and
system constraints (involving properties of the fact base as a whole).
But on the *other* other hand, I can see some constraints that are
technically object constraints, but which should not be run in a "naive"
way. That is, they shouldn't just loop over a set of related objects in
order to check some sequence-wide property, but instead should do a
database query if they're going to be checked at all. I can also see that
some "system constraints" should not be checked on every transaction, but
rather checked periodically by running queries to find exceptions.
So, for simple things, this three-way division is perfect, but in some
cases it leaves something to be desired, because knowledge of when to check
the constraint is sometimes based on implementation knowledge rather than
domain knowledge. Also, the actions to be taken when a constraint fails
can differ based on application-specific criteria. Although I've alluded
in my code sample to having a callback for failing constraints (that
defaults to raising an error), I don't think this is enough to accomodate
all of the use cases.
I think that the appropriate separation, then, is still to categorize
constraints in the three levels, but to use a separate generic function to
actually check the constraints. That is, instead of calling
'aConstraint.check(aValue)', one would call a generic function like
'aWorkspace.checkAttribute(aConstraint, aValue)' or
'aWorkspace.checkWhileCommitting(aConstraint, aValue)', or something of
that sort. This would allow "implementation-specific" overrides as to how
or whether a particular constraint should be checked.
That part of the concept still needs a bit more work, but could possibly do
the trick. In truth, system constraints aren't going to be checked on
individual objects anyway, so even though we need a way to express them
eventually, we don't need to worry about it for this stage of the
design. So that leaves us only with attribute and object constraints,
which are different only in terms of what constraints you're likely to
apply, and what objects they apply to.
So, the only real issue is, what constraint checks can and should be
deferred for performance reasons? Well, by their very nature, it would be
checks that apply to global properties of a sequence or mapping.
Why is that? Well, because an object's attribute schema is fixed, and
therefore bounded. So, only checks that require examining all the items in
a sequence or mapping are capable of consuming large quantities of time,
and should therefore be deferred.
Of course, in principle it's possible for there to be complex calculations
that might take a long time, like say validating that an attribute is the
Nth prime number for some large value of N. :) So, we don't *actually*
have to limit deferral to global properties of collections. But, it does
mean that we actually can allow the constraints themselves to decide
whether they should be deferred for batch processing. With an appropriate
mechanism, we could actually allow the constraints to add themselves and
the object to be checked to a deferral map of some kind, so that batch
processing need only check those constraints that haven't already been checked.
About the only scenario I can't seem to figure out is validating complex
constraints in the background for e.g. a GUI application, since that leads
to all sorts of "interesting" issues with threads, transaction conflict,
and the like. It seems like you'd have to have an explicit savepoint
during which the user couldn't edit things. On the other hand, maybe
there's a simple way to just defer applying GUI changes until the
background check completes. That way, the GUI can be responsive, but it
stops giving immediate validation while a batch validation is taking
place. As soon as batch validation finishes, the foreground changes can be
applied, and foreground validation can resume.
Wow. Pretty slick. In fact, I think I know how to handle the deferral map
thingy now too. A constraint that wants to defer can call a method on the
checked object's workspace. The workspace can then either stick the
constraint in a deferral map, or process it immediately, depending on its
choice. For example, an in-memory workspace (like the default workspace)
would process it immediately. More sophisticated workspaces could actually
use a generic function to decide what to do with the thing. For example,
a relational workspace might segregate deferred constraints into those that
must be run before DB writebacks occur, and those that can't be run unless
the writebacks have already taken place. But our basic implementations for
now would just either run immediately, or defer until commit time.
I can hardly believe it's this simple, after literally years of thinking
about how to do this "right".
Hm. I think the actual implementation is going to need to delegate checks
to the workspace from the get-go, rather than the way my draft above just
checks things left and right. I just remembered that for a web
application, a web form is going to want to go ahead and make the changes
to the object, and then create a list of errors/problems in order to
annotate the result form. Hmm. On the *other* other other hand, probably
all that's needed is a two-pass process...
Wait a sec. I'm making this way too complicated. ALL "object constraints"
are deferred, at the __setattr__ level. I mean, a constraint that involves
more than one attribute of an object has to be deferred unless you always
change one attribute before the other, and can do so in a way that the
change is still valid. So, in order to validate changes you make to an
object, you *have* to decide when to check the object constraints, no
matter what you do. At least, you must check when you are asserting a
stable state for the object.
For immutables, they reach stable state when __init__ is finished, so they
can check then and need never check again. For mutable objects, you only
know you're "stable" when you either explicitly request a check, or you're
committing a transaction. Thus, explicit checking is best. That check can
include attribute-level checks, in fact, so the only reason to do
attribute-level checks in __setattr__ is to get immediate error feedback.
An immediate error, however, *sounds* like a good idea, but in practice
it's actually not that useful unless the error is a *programming* error,
which it rarely will be. Usually the source of error is user input, so you
really need to just validate the user's input, and give them (non-error)
feedback. With the constraint registry in place, it's trivial to call a
function to do a check on the value that you were "going to" set, and it
makes a lot of sense to do so. If you try to validate a non-existent
attribute, you'll still get an error, since that *is* a programming error.
So, a UI would do two passes through the data. First, it pre-validates the
attribute values it intends to set. If all the attributes pass
individually, it then sets them all, and runs a check on the object as a
whole to find any higher-level issues.
Meanwhile, a Checked object needs to keep track of whether it's "Good",
"Bad", or "Unknown". After __init__, the state is "Good". If an
object-level check fails, it becomes "bad". If any attribute of the object
changes, the state is "unknown". If an object is in the "unknown" state at
commit, it has to be checked. If it's "good", or becomes "good" after a
check, it can be committed. If it's "bad" or becomes "bad" after the
check, then the commit fails. An object can still be "good" even if it has
constraints that are deferred until DB writeback, since those constraints
don't affect the writeback phase.
Interestingly, these characteristics mean that when persisting an object,
we would want to mix in Checked, because then we can track an object's
"dirty" state for purposes of doing writebacks.
So let me see if I can simplify and summarize:
* 'Checked' objects check attribute constraints at attribute set/delete time
* A given constraint can request the target object's workspace to defer
checking it until later
* The workspace can go ahead and check immediately, or wait
* An explicit check can request that even deferred constraints should be
checked immediately
* Inter-attribute constraints are only checked at an explicit check point,
like transaction commit
* 'Checked' objects know whether they are currently "good", "bad", or
"unknown", and only check object-wide constraints when they are "bad" or
"unknown", or the check is supposed to immediately check deferred constraints.
* All checking operations (other than simple __setattr__ constraints)
accept a callback for reporting constraint failures, so that UI tools can
gather up a collection of all known errors, not just the first one found.
* Workspaces initiate non-deferred constraint checks prior to DB writeback,
and then perform any checks that were deferred.
* Changing an object's workspace means that the new workspace must be able
to treat the object as if its state were "unknown", even if it's considered
"good". (Because the previous workspace might have deferred checks
registered for the object, that the new workspace doesn't know about.)
* When a transaction is aborted, dirty objects must have their classes set
to a subclass that rejects all attempts at use, and all their state should
be cleared. (This is just a persistence thing, not related to
constraints. But it means that the error produced by attempting to commit
with invalid data needs to contain plenty of information about the issue,
for debugging purposes.)
Well, I think that about sums it up. This constraint architecture allows
immediate, deferred, and partially deferred constraints on objects and on
their individual attributes or collection items, while allowing constraint
failure information to be accumulated for immediate or batch display to a
user. And, with a bit more tweaking, it could even distinguish between
levels of severity in violations. That is, you could have "warnings" or
"advisories", distinct from actual "errors". More precisely, we could have
Condition objects (per my previous Condition/Resolution post) that are
attached to constraints, and get passed to the workspace for resolution.
Ah, yes. That's actually much better. A 'Checked' object can just keep
track of its Conditions, or lack thereof, and also forward them to its
workspace whenever a new Condition is added, such as a Condition meaning,
"I'm changed and need to be written to the DB", or "I'm changed and haven't
had a full check run on me yet".
Now, instead of running an explicit "check" per se, one actually just
requests resolution of the object's current conditions. Per attribute
constraints are different, in that they are always checked immediately on
__setattr__, and any error Conditions are resolved by immediately raising
an error. Or, more precisely, there are attribute pre-checks/transformers
that validate the value before the attribute is set, and then there are
Conditions that get set once the attribute has actually changed, like the
"This object needs full validation" Condition.
Perfect! The two-pass UI update algorithm still works with this, but now
we can also apply the full dynamics of the Condition/Resolution concept to
allow triggering any kind of business rules or "agents" when objects are
changed. We can even let domain-model methods create Conditions to
indicate domain-level events of various kinds. Those Conditions can then
be trapped by workspace rules to produce UI effects, notifications, etc.
Of course, this now means we have to have a more detailed design of the
Condition/Resolution system, which is still on the vague side. Mainly,
that means defining what principal kinds of conditions we need, because of
course you can add whatever kinds you want later. But we need to settle on
basic kinds that cover the principal use cases. Also, I need to flesh out
the various (generic) methods on a workspace that will receive Condition
notifications, the API for resolving Conditions, and so on.
I won't be doing that tonight, though, as I've spent the last four hours
writing this and I need to eat something now. :)
More information about the PEAK
mailing list