[PEAK] Constraints as Condition/Resolution pairs

Sat Nov 27 17:58:31 EST 2004

What's in a Constraint?
-----------------------

What's the difference between a peak.model object and a normal Python 
object?  Only two things, really: metadata and constraints.  (Persistence 
is actually a special case of constraints, as we'll see later.)

Now that we have a simple mechanism for defining metadata in orthogonal 
ways, however, there's not much direct need for peak.model-style metadata 
any more.  That leaves constraints.

In its truest essence, a constraint is just a description of a desirable or 
undesirable state of some collection of objects.  A state or condition that 
we would like to maintain -- or avoid.

In peak.model, some of these constraints are maintained 
automatically.  Specifically, those constraints that involve bidirectional 
associations, multiplicity of collections, persistence, and so on.  Other 
constraints have to be implemented in code, often in a very complex 
fashion.  It's also difficult to implement delayed constraints of any sort, 
as the basic model is, "raise an exception and reject the operation if a 
constraint is violated."

So, I've been thinking about a new approach to constraints for peak.schema.

If we return to that essence of a constraint, we see that the constraint is 
really just a kind of query; a query that matches either "good" or "bad" 
states.  So, in theory, we could just run queries for "bad" states (by 
inverting "good" queries, if necessary), and then report on the results or 
take appropriate action.

Of course, as a practical matter, it's silly to query an entire database 
for invalid items  when you can prevent them becoming invalid.  So, one 
could limit the queries to the objects that are being changed within the 
current transaction or subtransaction.  For that matter, one could limit 
the queries so that they only involve *attributes* that have been changed.

As we get more and more precise regarding what queries should actually be 
run, we increase the complexity of the system in an attempt to gain 
performance.  The question is, is it premature (or at any rate, excessive) 
optimization to get so picky?

In a client-server (e.g. web) application, the typical transaction is 
either creating or updating an object using a full set of fields supplied 
by the user.  Every field has to be checked anyway, as does most every 
condition that crosses object boundaries.  So, if we perform all the checks 
at once in order to validate the input, then we are not going to be wasting 
time trying to track of which fields we changed and which ones we didn't.

In a GUI application, on the other hand, we have cycles to spare while we 
wait for the slow human to type and move the mouse around -- not to mention 
while they're busy interacting with other humans.   We can just as easily 
check all the fields whenever even one of them is changed.  Granted, we 
don't necessarily want to check inter-object constraints constantly, 
especially if they involve database queries.

So, one thing we could do is look at queries in a hierarchy, or divide 
constraints into various degrees of immediateness of checking.  Which maybe 
helps with the issue, but is going back into that whole place where we make 
the application more complex in order to squeeze out some performance.

Condition/Resolution
--------------------

Let's look at this another way.  Assume that a system of objects is 
initially in a state that satisfies all constraints.  We can think of this 
as the system being "in balance", like in double-entry bookkeeping.  Any 
action we take may potentially put the system "out of balance", requiring 
some opposing action to bring the system back to its "correct" state.  The 
opposing action could be another change, or it could be a reversal of the 
first change.  It could even be an action like notifying someone or 
recording the discrepancy in the database for later resolution.  The point 
is, once the system is out of balance, it must be rebalanced in order to 
commit a transaction or subtransaction.

I call this concept, "Condition/Resolution".  When you change an object, 
conceptually this may give rise to a Condition indicating that one of its 
constraints is violated.  Or, maybe you make a change that causes an 
existing Condition to be resolved.  For example, consider a constraint that 
says an inventory item's sale date must be after its purchase date.  If you 
are changing both, you might temporarily violate the constraint when you 
change one field, then correct it when you change the second field.

So, any change must (conceptually) check whether the before state and the 
after state give rise to a Condition, or the Resolution of that 
condition.  If any unresolved Conditions exist at commit time, the 
transaction or subtransaction simply refuses to commit.  (Of course, you 
*could* transfer a Condition to the parent transaction of a subtransaction, 
but that would technically be a Resolution of that condition, at least 
within the subtransaction's scope.)

When I first thought of this idea, I was thinking that you would actually 
literally check constraints on every change, and create Condition objects 
for each violated constraint.  However, there is actually an even simpler 
way to do it, which is just to generate Conditions meaning, "the state of 
such-and-such a thing has not been checked".  When it's time to resolve 
conditions, these conditions can "self-resolve", by carrying out the 
necessary check.  If the check(s) pass, the condition is resolved.  If they 
fail, the condition generates new conditions describing the 
failures.  Either way, the original condition is now resolved.

Conditions representing constraint violations, of course, can't be 
self-resolved in the general case.  They simply accumulate in the 
(sub)transaction and prevent it from committing.  Application code must 
iterate over the Conditions that can't self-resolve, and do something with 
them.

Self-resolve is actually a bit of a misnomer, since the same constraint may 
have different kinds of resolution required, depending on the 
circumstance.  So, in reality, there will probably be some sort of 
'resolveCondition()' generic function associated with the workspace, to 
allow context-specific resolution policies to be defined and automatically 
applied for specific kinds of conditions.  One of the default policies 
would of course be, "resolve unchecked-constraint conditions by checking 
the constraint(s) involved".

Database Interaction
--------------------

Another kind of Condition that's relevant here is "this object has been 
changed in memory but not written to the database".  Or, "this object has 
changed in a way that requires notifying a manager, but no email has been 
sent yet".  These are also constraints, even though they aren't the sort of 
thing that initially comes to mind as a constraint.  (Remember how I said 
persistence was a special case of constraint maintenance?)

When we expand the scope in this way, however, it becomes clear that 
transaction commit isn't the only action that may require conditions to be 
resolved first.  Running a database query while there are dirty objects in 
memory is a no-no, so effectively the query needs to ask the workspace to 
resolve such conditions.  The actual writing of data to the database, 
however, first requires any data-integrity conditions to be resolved.

This implies that there must be a way to query the workspace for active 
conditions of some kind, or at least a way to check whether a particular 
active condition affects the operation you are trying to perform.

Hm.  This seems to be one of those areas where the complexity just moves 
around when you push on it, instead of getting any simpler.  Let me restate 
again.

In an ideal world, we could simply check all constraints, all the time, and 
not worry about whether we were checking things we didn't change.  In a 
practical world, we need to avoid checking things that didn't change, 
because we often can't afford to.  (E.g. there are millions of unchanged 
rows we'd be checking.)

The second part of the problem is that as soon as we need to keep track of 
which things we changed, we can no longer rely on queries as a mechanism to 
implement constraints.  Certainly, we can't use database query facilities, 
because we would have had to first write the objects to the database, which 
we don't want to do until after we've checked the constraints!

So, the essential problem is that we must effectively perform queries that 
may bridge persistent and non-persistent state.  This is especially messy 
with regard to cross-object constraints.  Consider, for example, objects of 
type A and B, with a constraint between them.  We may have modified some 
objects of each type, some related and some not.  We cannot simply perform 
a single query, but instead are forced to check the constraint from the 
perspective of each individual A or B instance we have modified.

And this is where the complexity begins.  To do this, we must know each 
cross-object constraint that may affect objects of a particular type, and 
we must know how to rewrite the constraint query such that it is relative 
to that type.  For example, if every A must have at least one B such that 
B.x > A.x, then we must check when changing a B.x that there is still 
another B.x > A.x for the corresponding A.

But constraint rewriting is just the beginning of our woes here, because we 
also have to deal with the fact that we can't just query the database, 
since items might be modified in memory but not updated in the database -- 
because we have to check the constraints first.  For one-to-one 
relationships, we could just reference objects directly, because our cache 
mechanisms will trivially retrieve our in-memory versions of the related 
objects.  But, for one-to-many relationships, we will often need the 
assistance of the database to do the "heavy lifting" of a query.

So, there are two forms of complexity here: 1) constraint rewriting, and 2) 
N-way queries that must bridge persistent state and "dirty" state, where N 
is too large to be practically loaded into memory.

Luckily, the latter is more of a problem in theory than in practice -- we 
just have to defer such queries until the dirty data has been written to 
the database, but before it is committed.  The simple fact is that many 
relational databases don't even *attempt* to solve the issue of such 
complex constraints...  so the database isn't going to reject our writing 
incorrect data.  Indeed, I believe it's not all that common for databases 
to be actually defined with inter-*column* constraints, let alone 
inter-*table* constraints (apart from referential integrity restrictions).

So, in essence, any constraint that is too complicated for us to handle in 
Python is almost certainly going to be more complicated than the database 
can handle, so we should just write to the database first, and ask 
questions later.  This does lead to some mild complications when working in 
a GUI environment, to ensure that the database state gets rolled back once 
all the constraints have been checked.  But it's a lot simpler than the 
alternative.

The nice thing about this is that it means inter-object constraints can all 
be defined in terms of whatever query API we already have, which in turn 
will force flushing of objects by checking their constraints.

And now, here we are, having come full circle to a very simple model: if 
you have a dirty object, check all constraints that are required for the 
current operation to proceed.

Defining Constraints
--------------------

But is it really a simple model?  Or did we just hide the complexity under 
the heading of "required for the current operation"?  How do we know if a 
constraint is required for the current operation to proceed?  We really 
don't want to have to annotate every constraint with information about 
that.  So, we need a rule-based mechanism to decide that.  Constraints on 
constraints -- meta-constraints, in other words.

But that might not be as hard as it sounds.  All we have to do is express 
constraints as objects, and then they're just as queriable as any other set 
of objects.  For that matter, generic function predicates could perhaps be 
applied to constraints, to examine them and determine things like whether 
they involve more than one persistent object type beyond the mere 
"referential integrity" level, or if they involve only one-to-one 
relationships.

So, for a combination of a given changed object and the operation we want 
to perform, we can identify what constraints need to be checked, and check 
them, possibly creating condition objects that must then be resolved.

I had hoped to avoid creating a Babel of zillions of constraint types: 
value constraints, inter-field constraints, referential constraints, and 
query constraints.  I was thinking maybe we could just define everything in 
terms of queries, and thereby simplify the system and the API.

But I think that may not really be possible here.  Simple value constraints 
are easily checked on-the-fly, for example, but constraints of greater 
complexity than "referential" or "one-to-one query" have to be deferred 
when the query mechanism needs to use an external source.  So, having 
distinct kinds of constraint objects for these different complexity levels 
is probably useful, even if we create them by parsing query expressions (in 
a way similar to how the dispatch package creates generic function predicates).

So, the overall model would look something like this...  constraints are 
registered with a workspace class.  The registration process uses a generic 
function to determine how to "operationalize" the constraint.  That is, how 
the constraint is implemented.  For example, a value constraint may be 
implemented by adding a rule to the workspace's "change attribute" generic 
function to perform the check.

Thinking about this a little more, I see that this simple two-level 
approach allows a lot of flexibility.  The "operationalizing" generic 
function basically exists to add rules to other generic functions, based on 
the properties of a constraint.

The main difficulty I see with this approach is that it means that the 
operationalizer has to be fully populated with its rules, before you can 
define any constraints for that workspace class.  In fact, it may mean that 
the actual "operational" generic functions can't be initialized until a 
workspace instance is created, which would suck if you throw a lot of 
workspace objects around, and if the creation process were slow.

On the other hand, maybe all this really does is create operational 
functions on a per-class basis.  That is, it effectively generates 
__set__/__get__/__delete__ methods for a single class' attributes, using 
the constraints and other stuff as a guide.   This would let the process 
operate in a more lazy fashion, while having the advantage of pre-tailoring 
the methods for the specific context.

Wrap-Up
-------

A finished design, this certainly isn't.  I've got a lot to think about, 
moving forward.  The key new elements here are:

    * Use "metalevel" generic functions to transform abstract constraints 
into operational rules in the "implementation level" generic functions

    * Even if we express constraints as query expressions at the API level, 
transform them into simple object constraints wherever possible

    * Track "Conditions" that need "Resolution" within the current operation.

    * Don't worry about checking complex constraints before writing to the 
DB; if it's too complex to do without using the DB, the DB probably isn't 
smart enough to check the constraint by itself!

Actually, the third element isn't that new, except that I've formalized and 
generalized it as a broad category of things I previously thought of 
seperately.  And we've been implicitly doing the fourth one for some time, 
because there really wasn't any other choice.  But formalizing it is useful.

But the first one is very important, because it means that we can now say 
that the interpretation of constraints can *vary according to the kind of 
back-end storage*, which will give us a lot of implementation flexibility.