[PEAK] Some thoughts towards implementing peak.schema

Sun Oct 16 01:05:36 EDT 2005

Now that the setuptools 0.6 alpha series has finally landed, I've decided 
to take a little time towards making peak.schema a reality.  Note 
especially the "little" and "towards" - I don't expect any immediate results.

I have, however, hammered out a few things that are more or less solid in 
my mind, and there are some others left to work out.

First, a recap of the plan so far. peak.schema will be the successor to 
much of what now constitutes peak.model and peak.storage, taking advantage 
of "next generation" PEAK features like generic functions and dynamic 
variables.  (The "next generation" phrase being tongue-in-cheek, since Lisp 
has had both for decades.)

Instead of a data manager for each model element type, peak.schema will use 
a single logical "database" object for interacting with objects, and that 
logical database will actually be a dynamic variable, because the vast 
majority of code tends to perform operations against a single logical 
database.  (Code that needs to use more than one will of course still be 
able to do so.)

Using dynamic variables means we'll be able to have an interface comparable 
to that of SQLObject, which allows you to do queries using class methods 
without passing in an explicit database object.  An important difference is 
that you'll (eventually) define those queries using Python list or 
generator comprehensions, and retain the ability to override their mappings 
to SQL (or other back-end targets) with specific code by adding generic 
function methods.

Your model classes, however, will remain "pure" in the sense of defining 
only logical queries, not physical ones.  So this will make it easier to 
segregate mapping code for multiple backends, as when your application 
supports both sqlite and PostgreSQL, or Sybase and Oracle, or whatever.

The information model for peak.schema will have two principal kinds of 
objects: entity types and value types, similar to today's model.Element and 
model.Immutable.  Value types are immutable and can only refer to other 
value types, with no bidirectional links.  Constraints internal to value 
types are checked at creation.

Element types are mutable, and object-level constraints on them are checked 
at commit of a changed object or upon request.  Elements keep track of the 
database that "owns" them, and may only be "owned" by one database at a 
time (thereby simplifying transaction management, undo, etc.).

Both types also allow attribute-level constraints, which are basically an 
expression and a message (explaining what's wrong or what to correct).  In 
essence, they can be thought of as an assertion.  Indeed, constraints at 
all levels (attribute, object, and database) are logically just assertions, 
but need some additional metadata to allow skipping checks when the changes 
that occurred can't possibly affect the validity of something that's 
expensive to check.  This is an area that's still somewhat open to design, 
and early versions will probably just check everything you define.

Attributes
----------

In contrast to peak.model, peak.schema attributes will be defined using 
binding-like descriptors, rather than nested classes.  The nested class 
idea  was originally to support mixins via TransWarp's module inheritance 
AOP facility, and is no longer needed.  Early this year I developed a 
prototype of some of these descriptors as part of the "Spike" 
proof-of-concept for simplifying the Chandler architecture.  Since then, a 
production version of the descriptors was put into use, and Chandler 0.6 
will use them almost exclusively for schema definition.

This let me field-test and verify some of the ideas I had for schema 
description via attributes, and it seems to have worked out well.  I'll 
likely do something similar to what I implemented in Spike, which is closer 
to my original concept for peak.schema.  I do intend, however, to expand 
the total number of descriptors to five: One, Many, Sequence, Mapping, 
Graph.  (I had originally intended to do just the first four, and in Spike 
I tried to see how far I could get with just the first two.  Chandler uses 
my original four.)

Anyway, a "one" attribute is a normal attribute, "many" is a set (what you 
would usually want for a many-valued attribute), "sequence" is a list, 
"mapping" a dictionary-like object, and "graph" is a dictionary of 
sets.  In principle it could be a dictionary of lists, but that's getting 
too darn complicated for me.

One reason, by the way, for distinguishing "many" and "sequence" is that 
relational databases suck at maintaining a user-defined sequence.  That is, 
if I chuck "1 9 3 27 16" into a database, I'm not guaranteed to get them 
back in that order, although I can easily get them to come back "1 3 9 16 
27" if I want.  If you need a sequence, you need an extra column to hold 
either the sequence numbers or a series of linked-list pointers.  Each 
approach has advantages and disadvantages, so the choice of the translation 
will need to be part of the database mapping.  (And I'm tempted to make it 
so that you *have* to specify it manually, to discourage unnecessary use of 
sequences where "many" will do.)

The descriptors themselves are not really responsible for mapping to a 
database, of course.  They are only responsible for supplying proxy 
structures that convert to and from a fact-oriented (aka domain relational 
calculus) view of the data.  In essence, this typically amounts to flinging 
tuples around.  For example, a "One" attribute is nominally represented in 
fact orientation using a two-tuple: the object and the attribute value, and 
you can either add or remove that tuple from a fact set.  So a "One" 
descriptor will remove the old tuple and add the new one when you change a 
value.  The underlying database object and various helpers will be 
responsible for managing translation from this "elementary conceptual fact" 
format to an appropriate "compound physical fact" format -- i.e., table row 
modifications, or stores to sets and dictionaries in-memory.  :)  The 
database and its helpers will also be responsible for caching reads and/or 
aggregating writes.

A "Many" attribute is similarly easy to convert, but requires a 
"set"-interface wrapper that converts reads and writes on the set into 
elementary fact tuple operations.  Pretty much a piece of cake.

Sequence, Mapping, and Graph are all more interesting because they involve 
at least "ternary facts" - i.e. 3-tuples or higher.  Depending on the 
mapping, a sequence might be implemented using sequence numbers or a linked 
list.  Both mappings suck, for different reasons.  If you actually use 
sequential sequence numbers, you get cheap indexed sequential access, but 
you incur O(N) renumbering cost every time you do something other than add 
to the end.  Linked lists are pretty cheap to re-order but sequential 
access is O(M log N) where M is the number of items retrieved (versus O(log 
N)+M for sequential).  If properly indexed, you can jump to the start or 
end of a linked list pretty cheaply and iterate from there, but paging is hard.

Not that paging in general is easy.  Of course, a key factor to keep in 
mind about these kinds of sequence is that if there's any reorganization, 
it's normally done by a human.  So there are some natural limits to how big 
the list can get and how far up or down a person is going to drag 
something.  Nonetheless, it's best to consider these issues before choosing 
to model something as a sequence instead of a set.

Mappings and Graphs are different from Sequence in that they directly 
expose the third part of the ternary fact - the "key".  I still haven't 
gone through all the possibilities as far as combining these with other 
descriptors at the "opposite end".  It seems to me that in principle you 
should be able to do things like have an object with a "name" attribute 
that actually corresponds to the key in a Mapping attribute on some other 
object.  Certainly you can model that concept in a fact-oriented schema, 
but I haven't yet figured out how you would "spell" that in the object 
version of the schema.

All of the attribute types except "One" will do most of their work via some 
type of proxy container object, simulating a set, list, dictionary, or 
dictionary of sets.

In principle, it should be possible to define other kinds of attributes 
besides these few essentials.  An attribute descriptor's main 
responsibilities are:

* Create the fact-level schema for itself, and transform model-defined 
constraints to fact-level constraints, adding them to the fact-level schema

* Transform reads and writes to what's needed for the fact layer (possibly 
via a helper proxy)

* Check attribute-level constraints at assignment time (when used in a 
Value type; the fact layer is responsible for this when using Entity types)

As you might guess, this probably means that peak.schema Entity objects 
will be somewhat slower to use than Value objects, since reads and writes 
will be passed through to the fact layer, at least in early 
versions.  We'll experiment with caches later, once the semantics are 
sufficiently settled.

Anyway, the general idea should be that if you can come up with a 
descriptor for some specialized kind of feature mapping (e.g. transitive 
relationships, symmetric relationships, or other database exotica), you 
should be able to do it by having your descriptor translate to a fact 
oriented schema, and by using an appropriate helper proxy.

The helper proxies will be created by a generic function (of course) that 
has access to the descriptor, the fact schema, and the target 
database.  Proxies can and do get cached in a given Entity instance, so the 
creation (and rule overhead) should only occur on first access.  But the 
use of a generic function means you'll have the opportunity to tweak the 
proxy type selection (or instance configuration) for a specialized 
situation if you need to.  It also means that the development path of the 
core descriptors should be relatively smooth for me; I can do most of the 
core development with One, Many, and Mapping, without needing any 
restructuring to implement the rest.

Sets and Constraints
--------------------

Sets are at the heart of peak.schema.  Types are sets.  Constraints are 
sets (of valid objects).  Even attributes are sets!  (The set of values of 
the attribute over all objects of that type.)

Constraints are applied to sets.  You can say that one set must be a subset 
of another, for example.  Some of the uses of this include saying that an 
attribute must hold a certain type (i.e., the set of values of this 
attribute must be a subset of the values of the type).

There are in fact only three fundamental kinds of constraints in 
peak.schema (and in the modelling theories that it's based on):

* subset relationships (subset, equivalence, and mutual exclusion)
* cardinality constraints (e.g. uniqueness)
* "ring" constraints (acyclic, intransitivity, irreflexiveness, etc.)

At the object level of the schema, however, these are more diverse in 
nature.  Most of the what commonly get thought of as constraints or type 
information in other schema systems (like zope.schema and FormEncode's 
validators) are subset constraints.  Or more precisely, those kinds of 
constraints define sets (possibly infinite ones) that you then want 
attributes to be a subset of.  Even required-ness or nullability of an 
attribute can be expressed in terms of subset or set equivalence 
relationships, by saying that the set of objects having the attribute is a 
subset of (or an equivalent set to) the set of objects of that type.  (Thus 
making it optional or mandatory.)

But at the fact-oriented layer of the schema, all these constraints will 
just be reflected as "X is a subset of Y" or "X is the same set as Y", 
without really caring about what set Y represents or how it's 
implemented.  It'll simply attempt to validate that constraint when adding 
or removing values in set X, and fancier fact-to-relational mappings might 
also apply the constraints to the DDL as well with CHECK clauses.  (Or, if 
the target set is something stored in the database, an appropriate foreign 
key constraint should get created.)

Cardinality constraints are also familiar DB concepts - keys and 
indexes.  Technically, index information isn't about cardinality per se, 
but I think it'll be simplest to express index information and optimizing 
hints via the same sort of structure as declaring cardinality or uniqueness 
constraints.

Finally, ring constraints are kind of a specialty thing.  I'd like to make 
sure the infrastructure can support things like those, but I don't really 
plan to do a full implementation of all the ORM-style ring constraints, 
certainly not in the first version.  :)  It's more that I want to make sure 
that the infrastructure can deal effectively with exotic constraints.

The object layer will provide lots of syntactic sugar for constraints, so 
that common subset constraints can be specified compactly.  For example, 
you should be able to just specify a type or a validator of some kind and 
have it just "know" you mean to make an attribute a subset of 
that.   Specifying more exotic constraints like mutual exclusion or set 
equivalence should require some sort of wrapper, like 'exclude(...)'  or 
'exactly(...)' or something.  Anyway, in the simple case:

     foo = schema.One(Bar)

should just know you mean to make the 'foo' attribute accept 'Bar' 
instances.  More elaborately:

     foo = schema.One(Bar, someValidator, otherConstraint)

would apply the other constraints to the attribute as well.

Queries
-------

A couple of years ago, I tried to create a peak.query library to implement 
conceptual queries against arbitrary database back-ends.  Unfortunately, I 
only knew just enough about the field to be dangerous.  :)  I grokked the 
domain relational calculus, tuple relational calculus, relational algebra, 
some ORM query stuff, and I even figured out that Python listcomps are 
isomorphic to the domain relational calculus, making them an ideal syntax 
for expressing queries in Python.  But I was a bit stumped as to how to 
make it all work, between objects and a relational DB, or even multiple 
DBs.  It was just too complex.

I did, however, order the ORM text, "Information Modeling and Relational 
Databases", a huge tome on the basis of conceptual modeling and elementary 
fact orientation.  And apparently I've beat myself over the head with it 
long enough that some of it has sunk in, because when I mentally picked up 
where I left off on peak.query, I found that my brain has filled in a 
surprising amount of the gaps in what I did previously.

Partly, this is because modeling based on elementary facts is much simpler 
than either the relational or object model.  Elementary facts are simple 
ideas like "The person with name 'Joe' was hired on '9/1/1991'", usually 
representable as 2-tuples or 3-tuples.  So, to evaluate a query like:

     [p.birthdate, p.hiredate for p in Person if p.name=='Joe']

You simply convert it to the domain relational calculus as follows:

    v1,v2: person_birthdate(v3,v1) and person_hired(v3,v2) and v3=='Joe'

which then simplifies to:

    v1,v2: person_birthdate('Joe',v1) and person_hired('Joe',v2)

And then transform to the tuple relational calculus, assuming the 
person_birthdate and person_hired facts map to an 'employees' table:

    v1, v2: employees('Joe', v1, v2, ...wildcards for other columns)

And this then trivially maps to SQL:

    select e1.birthdate, e1.hire_date from employees e1 where name='Joe'

But the real advantage comes in with stuff like:

    [p.name, p.spouse.name for p in Person if p.spouse.hiredate < p.hiredate]

which becomes a DRC query of:

     v1,v2: person_spouse(v1,v2) and person_hiredate(v1,v4) and
            person_hiredate(v2,v5) and lt(v5,v4)

and then a TRC query of:

     v1, v2: employees(v1,...,hiredate=v4,spouse=v2) and
             employees(v2,...,hiredate=v5) and lt(v5,v4)

resulting in SQL of:

     SELECT e1.name, e1.spouse
       FROM employees e1, employees e2
      WHERE e2.name=e1.spouse AND e2.hiredate<e1.hiredate

In other words, arbitrary joins, even cartesian joins, made fresh to order.

Now, none of these translation stages are *trivial*, mind you.  They 
absolutely depend on knowing things like what type 'p.spouse' will refer 
to, and so on.  But the basic process should also be extensible to multiple 
databases or even no database, in the sense that pure in-memory data 
structures should be just as queriable.

One of the things that made peak.query seem farfetched at its inception was 
that I hadn't implemented generic functions yet, so I wasn't entirely 
certain that I could implement a Python expression parser of this 
complexity.  Now, I've already got an implementation that works nicely for 
everything but the listcomp structure itself.  It should be a breeze to 
update it, and I also now have experience doing the sort of abstract 
expression interpretation and constant folding that the generic function 
stuff does, and that this system will need.

Also, as it happens, the actual implementation of a system like this really 
want to use generic functions for at least the SQL generation part.  Lots 
of silly little details vary between SQL implementations in ways that 
really want to be rule-based.

The middle layer, DRC->TRC conversion and tuple merging, is pretty simple, 
as the elmentary fact types are basically views onto tables.  So, the 
conversion basically consists of replacing the elementary fact types with 
the tables they're views of, and filling in wildcards for the empty 
slots.  You then go through and unify any predicates that are against the 
same table with matching variables in unique columns (or sets of unique 
columns), thereby reducing to simple TRC.

For actual query generation, you can support multiple back-ends by having 
each back end pull out the parts of the TRC that relate to it, replacing 
them with pseudo-tables representing the portion of the query they can 
handle.  When all backends have done this, the remaining TRC predicates 
have to be joined by the engine itself, to do any cross-database joins.  If 
there's just one predicate, the engine just iterates over it and yields the 
results.

Of course, there are many devils in the details!  In this simple overview, 
I've avoided things like 'OR', outer joins, aggregate functions and the 
like.  Some of these things will be harder to implement than others, mainly 
because I need to work out how to express them cleanly in my concept of how 
the DRC/TRC intermediate representations will work, especially if they're 
to map cleanly to SQL.  The grammar isn't an issue, because things like 
this are natural in Python:

    [d.name, max(p.salary for p in d.employees) for d in Department]

Thus, the only sticking point is how to get from there to the SQL that may 
need to use either subqueries or GROUP BY, and handling cases where the 
backend doesn't support one of the needed features.

But, once these issues are resolved, peak.query will end up being the 
easiest to use, most natural syntax possible for expressing database 
queries, against zero or more physical back ends.  Coupled with the other 
features I've described, peak.schema should be able to be the premier 
object-relational library -- not just for Python, but for any programming 
language, hands down.

Any comments or questions so far?  I haven't covered issues of caching or 
transactions to any significant extent yet, mainly because that's (mostly) 
beneath the fact layer, except insofar as transaction boundaries are likely 
the appropriate time for certain kinds of constraint checking.