[PEAK] Constraints as Condition/Resolution pairs
Phillip J. Eby
pje at telecommunity.com
Sat Nov 27 17:58:31 EST 2004
What's in a Constraint?
-----------------------
What's the difference between a peak.model object and a normal Python
object? Only two things, really: metadata and constraints. (Persistence
is actually a special case of constraints, as we'll see later.)
Now that we have a simple mechanism for defining metadata in orthogonal
ways, however, there's not much direct need for peak.model-style metadata
any more. That leaves constraints.
In its truest essence, a constraint is just a description of a desirable or
undesirable state of some collection of objects. A state or condition that
we would like to maintain -- or avoid.
In peak.model, some of these constraints are maintained
automatically. Specifically, those constraints that involve bidirectional
associations, multiplicity of collections, persistence, and so on. Other
constraints have to be implemented in code, often in a very complex
fashion. It's also difficult to implement delayed constraints of any sort,
as the basic model is, "raise an exception and reject the operation if a
constraint is violated."
So, I've been thinking about a new approach to constraints for peak.schema.
If we return to that essence of a constraint, we see that the constraint is
really just a kind of query; a query that matches either "good" or "bad"
states. So, in theory, we could just run queries for "bad" states (by
inverting "good" queries, if necessary), and then report on the results or
take appropriate action.
Of course, as a practical matter, it's silly to query an entire database
for invalid items when you can prevent them becoming invalid. So, one
could limit the queries to the objects that are being changed within the
current transaction or subtransaction. For that matter, one could limit
the queries so that they only involve *attributes* that have been changed.
As we get more and more precise regarding what queries should actually be
run, we increase the complexity of the system in an attempt to gain
performance. The question is, is it premature (or at any rate, excessive)
optimization to get so picky?
In a client-server (e.g. web) application, the typical transaction is
either creating or updating an object using a full set of fields supplied
by the user. Every field has to be checked anyway, as does most every
condition that crosses object boundaries. So, if we perform all the checks
at once in order to validate the input, then we are not going to be wasting
time trying to track of which fields we changed and which ones we didn't.
In a GUI application, on the other hand, we have cycles to spare while we
wait for the slow human to type and move the mouse around -- not to mention
while they're busy interacting with other humans. We can just as easily
check all the fields whenever even one of them is changed. Granted, we
don't necessarily want to check inter-object constraints constantly,
especially if they involve database queries.
So, one thing we could do is look at queries in a hierarchy, or divide
constraints into various degrees of immediateness of checking. Which maybe
helps with the issue, but is going back into that whole place where we make
the application more complex in order to squeeze out some performance.
Condition/Resolution
--------------------
Let's look at this another way. Assume that a system of objects is
initially in a state that satisfies all constraints. We can think of this
as the system being "in balance", like in double-entry bookkeeping. Any
action we take may potentially put the system "out of balance", requiring
some opposing action to bring the system back to its "correct" state. The
opposing action could be another change, or it could be a reversal of the
first change. It could even be an action like notifying someone or
recording the discrepancy in the database for later resolution. The point
is, once the system is out of balance, it must be rebalanced in order to
commit a transaction or subtransaction.
I call this concept, "Condition/Resolution". When you change an object,
conceptually this may give rise to a Condition indicating that one of its
constraints is violated. Or, maybe you make a change that causes an
existing Condition to be resolved. For example, consider a constraint that
says an inventory item's sale date must be after its purchase date. If you
are changing both, you might temporarily violate the constraint when you
change one field, then correct it when you change the second field.
So, any change must (conceptually) check whether the before state and the
after state give rise to a Condition, or the Resolution of that
condition. If any unresolved Conditions exist at commit time, the
transaction or subtransaction simply refuses to commit. (Of course, you
*could* transfer a Condition to the parent transaction of a subtransaction,
but that would technically be a Resolution of that condition, at least
within the subtransaction's scope.)
When I first thought of this idea, I was thinking that you would actually
literally check constraints on every change, and create Condition objects
for each violated constraint. However, there is actually an even simpler
way to do it, which is just to generate Conditions meaning, "the state of
such-and-such a thing has not been checked". When it's time to resolve
conditions, these conditions can "self-resolve", by carrying out the
necessary check. If the check(s) pass, the condition is resolved. If they
fail, the condition generates new conditions describing the
failures. Either way, the original condition is now resolved.
Conditions representing constraint violations, of course, can't be
self-resolved in the general case. They simply accumulate in the
(sub)transaction and prevent it from committing. Application code must
iterate over the Conditions that can't self-resolve, and do something with
them.
Self-resolve is actually a bit of a misnomer, since the same constraint may
have different kinds of resolution required, depending on the
circumstance. So, in reality, there will probably be some sort of
'resolveCondition()' generic function associated with the workspace, to
allow context-specific resolution policies to be defined and automatically
applied for specific kinds of conditions. One of the default policies
would of course be, "resolve unchecked-constraint conditions by checking
the constraint(s) involved".
Database Interaction
--------------------
Another kind of Condition that's relevant here is "this object has been
changed in memory but not written to the database". Or, "this object has
changed in a way that requires notifying a manager, but no email has been
sent yet". These are also constraints, even though they aren't the sort of
thing that initially comes to mind as a constraint. (Remember how I said
persistence was a special case of constraint maintenance?)
When we expand the scope in this way, however, it becomes clear that
transaction commit isn't the only action that may require conditions to be
resolved first. Running a database query while there are dirty objects in
memory is a no-no, so effectively the query needs to ask the workspace to
resolve such conditions. The actual writing of data to the database,
however, first requires any data-integrity conditions to be resolved.
This implies that there must be a way to query the workspace for active
conditions of some kind, or at least a way to check whether a particular
active condition affects the operation you are trying to perform.
Hm. This seems to be one of those areas where the complexity just moves
around when you push on it, instead of getting any simpler. Let me restate
again.
In an ideal world, we could simply check all constraints, all the time, and
not worry about whether we were checking things we didn't change. In a
practical world, we need to avoid checking things that didn't change,
because we often can't afford to. (E.g. there are millions of unchanged
rows we'd be checking.)
The second part of the problem is that as soon as we need to keep track of
which things we changed, we can no longer rely on queries as a mechanism to
implement constraints. Certainly, we can't use database query facilities,
because we would have had to first write the objects to the database, which
we don't want to do until after we've checked the constraints!
So, the essential problem is that we must effectively perform queries that
may bridge persistent and non-persistent state. This is especially messy
with regard to cross-object constraints. Consider, for example, objects of
type A and B, with a constraint between them. We may have modified some
objects of each type, some related and some not. We cannot simply perform
a single query, but instead are forced to check the constraint from the
perspective of each individual A or B instance we have modified.
And this is where the complexity begins. To do this, we must know each
cross-object constraint that may affect objects of a particular type, and
we must know how to rewrite the constraint query such that it is relative
to that type. For example, if every A must have at least one B such that
B.x > A.x, then we must check when changing a B.x that there is still
another B.x > A.x for the corresponding A.
But constraint rewriting is just the beginning of our woes here, because we
also have to deal with the fact that we can't just query the database,
since items might be modified in memory but not updated in the database --
because we have to check the constraints first. For one-to-one
relationships, we could just reference objects directly, because our cache
mechanisms will trivially retrieve our in-memory versions of the related
objects. But, for one-to-many relationships, we will often need the
assistance of the database to do the "heavy lifting" of a query.
So, there are two forms of complexity here: 1) constraint rewriting, and 2)
N-way queries that must bridge persistent state and "dirty" state, where N
is too large to be practically loaded into memory.
Luckily, the latter is more of a problem in theory than in practice -- we
just have to defer such queries until the dirty data has been written to
the database, but before it is committed. The simple fact is that many
relational databases don't even *attempt* to solve the issue of such
complex constraints... so the database isn't going to reject our writing
incorrect data. Indeed, I believe it's not all that common for databases
to be actually defined with inter-*column* constraints, let alone
inter-*table* constraints (apart from referential integrity restrictions).
So, in essence, any constraint that is too complicated for us to handle in
Python is almost certainly going to be more complicated than the database
can handle, so we should just write to the database first, and ask
questions later. This does lead to some mild complications when working in
a GUI environment, to ensure that the database state gets rolled back once
all the constraints have been checked. But it's a lot simpler than the
alternative.
The nice thing about this is that it means inter-object constraints can all
be defined in terms of whatever query API we already have, which in turn
will force flushing of objects by checking their constraints.
And now, here we are, having come full circle to a very simple model: if
you have a dirty object, check all constraints that are required for the
current operation to proceed.
Defining Constraints
--------------------
But is it really a simple model? Or did we just hide the complexity under
the heading of "required for the current operation"? How do we know if a
constraint is required for the current operation to proceed? We really
don't want to have to annotate every constraint with information about
that. So, we need a rule-based mechanism to decide that. Constraints on
constraints -- meta-constraints, in other words.
But that might not be as hard as it sounds. All we have to do is express
constraints as objects, and then they're just as queriable as any other set
of objects. For that matter, generic function predicates could perhaps be
applied to constraints, to examine them and determine things like whether
they involve more than one persistent object type beyond the mere
"referential integrity" level, or if they involve only one-to-one
relationships.
So, for a combination of a given changed object and the operation we want
to perform, we can identify what constraints need to be checked, and check
them, possibly creating condition objects that must then be resolved.
I had hoped to avoid creating a Babel of zillions of constraint types:
value constraints, inter-field constraints, referential constraints, and
query constraints. I was thinking maybe we could just define everything in
terms of queries, and thereby simplify the system and the API.
But I think that may not really be possible here. Simple value constraints
are easily checked on-the-fly, for example, but constraints of greater
complexity than "referential" or "one-to-one query" have to be deferred
when the query mechanism needs to use an external source. So, having
distinct kinds of constraint objects for these different complexity levels
is probably useful, even if we create them by parsing query expressions (in
a way similar to how the dispatch package creates generic function predicates).
So, the overall model would look something like this... constraints are
registered with a workspace class. The registration process uses a generic
function to determine how to "operationalize" the constraint. That is, how
the constraint is implemented. For example, a value constraint may be
implemented by adding a rule to the workspace's "change attribute" generic
function to perform the check.
Thinking about this a little more, I see that this simple two-level
approach allows a lot of flexibility. The "operationalizing" generic
function basically exists to add rules to other generic functions, based on
the properties of a constraint.
The main difficulty I see with this approach is that it means that the
operationalizer has to be fully populated with its rules, before you can
define any constraints for that workspace class. In fact, it may mean that
the actual "operational" generic functions can't be initialized until a
workspace instance is created, which would suck if you throw a lot of
workspace objects around, and if the creation process were slow.
On the other hand, maybe all this really does is create operational
functions on a per-class basis. That is, it effectively generates
__set__/__get__/__delete__ methods for a single class' attributes, using
the constraints and other stuff as a guide. This would let the process
operate in a more lazy fashion, while having the advantage of pre-tailoring
the methods for the specific context.
Wrap-Up
-------
A finished design, this certainly isn't. I've got a lot to think about,
moving forward. The key new elements here are:
* Use "metalevel" generic functions to transform abstract constraints
into operational rules in the "implementation level" generic functions
* Even if we express constraints as query expressions at the API level,
transform them into simple object constraints wherever possible
* Track "Conditions" that need "Resolution" within the current operation.
* Don't worry about checking complex constraints before writing to the
DB; if it's too complex to do without using the DB, the DB probably isn't
smart enough to check the constraint by itself!
Actually, the third element isn't that new, except that I've formalized and
generalized it as a broad category of things I previously thought of
seperately. And we've been implicitly doing the fourth one for some time,
because there really wasn't any other choice. But formalizing it is useful.
But the first one is very important, because it means that we can now say
that the interpretation of constraints can *vary according to the kind of
back-end storage*, which will give us a lot of implementation flexibility.
More information about the PEAK
mailing list