[PEAK] Some thoughts towards implementing peak.schema
Phillip J. Eby
pje at telecommunity.com
Sun Oct 16 01:05:36 EDT 2005
Now that the setuptools 0.6 alpha series has finally landed, I've decided
to take a little time towards making peak.schema a reality. Note
especially the "little" and "towards" - I don't expect any immediate results.
I have, however, hammered out a few things that are more or less solid in
my mind, and there are some others left to work out.
First, a recap of the plan so far. peak.schema will be the successor to
much of what now constitutes peak.model and peak.storage, taking advantage
of "next generation" PEAK features like generic functions and dynamic
variables. (The "next generation" phrase being tongue-in-cheek, since Lisp
has had both for decades.)
Instead of a data manager for each model element type, peak.schema will use
a single logical "database" object for interacting with objects, and that
logical database will actually be a dynamic variable, because the vast
majority of code tends to perform operations against a single logical
database. (Code that needs to use more than one will of course still be
able to do so.)
Using dynamic variables means we'll be able to have an interface comparable
to that of SQLObject, which allows you to do queries using class methods
without passing in an explicit database object. An important difference is
that you'll (eventually) define those queries using Python list or
generator comprehensions, and retain the ability to override their mappings
to SQL (or other back-end targets) with specific code by adding generic
function methods.
Your model classes, however, will remain "pure" in the sense of defining
only logical queries, not physical ones. So this will make it easier to
segregate mapping code for multiple backends, as when your application
supports both sqlite and PostgreSQL, or Sybase and Oracle, or whatever.
The information model for peak.schema will have two principal kinds of
objects: entity types and value types, similar to today's model.Element and
model.Immutable. Value types are immutable and can only refer to other
value types, with no bidirectional links. Constraints internal to value
types are checked at creation.
Element types are mutable, and object-level constraints on them are checked
at commit of a changed object or upon request. Elements keep track of the
database that "owns" them, and may only be "owned" by one database at a
time (thereby simplifying transaction management, undo, etc.).
Both types also allow attribute-level constraints, which are basically an
expression and a message (explaining what's wrong or what to correct). In
essence, they can be thought of as an assertion. Indeed, constraints at
all levels (attribute, object, and database) are logically just assertions,
but need some additional metadata to allow skipping checks when the changes
that occurred can't possibly affect the validity of something that's
expensive to check. This is an area that's still somewhat open to design,
and early versions will probably just check everything you define.
Attributes
----------
In contrast to peak.model, peak.schema attributes will be defined using
binding-like descriptors, rather than nested classes. The nested class
idea was originally to support mixins via TransWarp's module inheritance
AOP facility, and is no longer needed. Early this year I developed a
prototype of some of these descriptors as part of the "Spike"
proof-of-concept for simplifying the Chandler architecture. Since then, a
production version of the descriptors was put into use, and Chandler 0.6
will use them almost exclusively for schema definition.
This let me field-test and verify some of the ideas I had for schema
description via attributes, and it seems to have worked out well. I'll
likely do something similar to what I implemented in Spike, which is closer
to my original concept for peak.schema. I do intend, however, to expand
the total number of descriptors to five: One, Many, Sequence, Mapping,
Graph. (I had originally intended to do just the first four, and in Spike
I tried to see how far I could get with just the first two. Chandler uses
my original four.)
Anyway, a "one" attribute is a normal attribute, "many" is a set (what you
would usually want for a many-valued attribute), "sequence" is a list,
"mapping" a dictionary-like object, and "graph" is a dictionary of
sets. In principle it could be a dictionary of lists, but that's getting
too darn complicated for me.
One reason, by the way, for distinguishing "many" and "sequence" is that
relational databases suck at maintaining a user-defined sequence. That is,
if I chuck "1 9 3 27 16" into a database, I'm not guaranteed to get them
back in that order, although I can easily get them to come back "1 3 9 16
27" if I want. If you need a sequence, you need an extra column to hold
either the sequence numbers or a series of linked-list pointers. Each
approach has advantages and disadvantages, so the choice of the translation
will need to be part of the database mapping. (And I'm tempted to make it
so that you *have* to specify it manually, to discourage unnecessary use of
sequences where "many" will do.)
The descriptors themselves are not really responsible for mapping to a
database, of course. They are only responsible for supplying proxy
structures that convert to and from a fact-oriented (aka domain relational
calculus) view of the data. In essence, this typically amounts to flinging
tuples around. For example, a "One" attribute is nominally represented in
fact orientation using a two-tuple: the object and the attribute value, and
you can either add or remove that tuple from a fact set. So a "One"
descriptor will remove the old tuple and add the new one when you change a
value. The underlying database object and various helpers will be
responsible for managing translation from this "elementary conceptual fact"
format to an appropriate "compound physical fact" format -- i.e., table row
modifications, or stores to sets and dictionaries in-memory. :) The
database and its helpers will also be responsible for caching reads and/or
aggregating writes.
A "Many" attribute is similarly easy to convert, but requires a
"set"-interface wrapper that converts reads and writes on the set into
elementary fact tuple operations. Pretty much a piece of cake.
Sequence, Mapping, and Graph are all more interesting because they involve
at least "ternary facts" - i.e. 3-tuples or higher. Depending on the
mapping, a sequence might be implemented using sequence numbers or a linked
list. Both mappings suck, for different reasons. If you actually use
sequential sequence numbers, you get cheap indexed sequential access, but
you incur O(N) renumbering cost every time you do something other than add
to the end. Linked lists are pretty cheap to re-order but sequential
access is O(M log N) where M is the number of items retrieved (versus O(log
N)+M for sequential). If properly indexed, you can jump to the start or
end of a linked list pretty cheaply and iterate from there, but paging is hard.
Not that paging in general is easy. Of course, a key factor to keep in
mind about these kinds of sequence is that if there's any reorganization,
it's normally done by a human. So there are some natural limits to how big
the list can get and how far up or down a person is going to drag
something. Nonetheless, it's best to consider these issues before choosing
to model something as a sequence instead of a set.
Mappings and Graphs are different from Sequence in that they directly
expose the third part of the ternary fact - the "key". I still haven't
gone through all the possibilities as far as combining these with other
descriptors at the "opposite end". It seems to me that in principle you
should be able to do things like have an object with a "name" attribute
that actually corresponds to the key in a Mapping attribute on some other
object. Certainly you can model that concept in a fact-oriented schema,
but I haven't yet figured out how you would "spell" that in the object
version of the schema.
All of the attribute types except "One" will do most of their work via some
type of proxy container object, simulating a set, list, dictionary, or
dictionary of sets.
In principle, it should be possible to define other kinds of attributes
besides these few essentials. An attribute descriptor's main
responsibilities are:
* Create the fact-level schema for itself, and transform model-defined
constraints to fact-level constraints, adding them to the fact-level schema
* Transform reads and writes to what's needed for the fact layer (possibly
via a helper proxy)
* Check attribute-level constraints at assignment time (when used in a
Value type; the fact layer is responsible for this when using Entity types)
As you might guess, this probably means that peak.schema Entity objects
will be somewhat slower to use than Value objects, since reads and writes
will be passed through to the fact layer, at least in early
versions. We'll experiment with caches later, once the semantics are
sufficiently settled.
Anyway, the general idea should be that if you can come up with a
descriptor for some specialized kind of feature mapping (e.g. transitive
relationships, symmetric relationships, or other database exotica), you
should be able to do it by having your descriptor translate to a fact
oriented schema, and by using an appropriate helper proxy.
The helper proxies will be created by a generic function (of course) that
has access to the descriptor, the fact schema, and the target
database. Proxies can and do get cached in a given Entity instance, so the
creation (and rule overhead) should only occur on first access. But the
use of a generic function means you'll have the opportunity to tweak the
proxy type selection (or instance configuration) for a specialized
situation if you need to. It also means that the development path of the
core descriptors should be relatively smooth for me; I can do most of the
core development with One, Many, and Mapping, without needing any
restructuring to implement the rest.
Sets and Constraints
--------------------
Sets are at the heart of peak.schema. Types are sets. Constraints are
sets (of valid objects). Even attributes are sets! (The set of values of
the attribute over all objects of that type.)
Constraints are applied to sets. You can say that one set must be a subset
of another, for example. Some of the uses of this include saying that an
attribute must hold a certain type (i.e., the set of values of this
attribute must be a subset of the values of the type).
There are in fact only three fundamental kinds of constraints in
peak.schema (and in the modelling theories that it's based on):
* subset relationships (subset, equivalence, and mutual exclusion)
* cardinality constraints (e.g. uniqueness)
* "ring" constraints (acyclic, intransitivity, irreflexiveness, etc.)
At the object level of the schema, however, these are more diverse in
nature. Most of the what commonly get thought of as constraints or type
information in other schema systems (like zope.schema and FormEncode's
validators) are subset constraints. Or more precisely, those kinds of
constraints define sets (possibly infinite ones) that you then want
attributes to be a subset of. Even required-ness or nullability of an
attribute can be expressed in terms of subset or set equivalence
relationships, by saying that the set of objects having the attribute is a
subset of (or an equivalent set to) the set of objects of that type. (Thus
making it optional or mandatory.)
But at the fact-oriented layer of the schema, all these constraints will
just be reflected as "X is a subset of Y" or "X is the same set as Y",
without really caring about what set Y represents or how it's
implemented. It'll simply attempt to validate that constraint when adding
or removing values in set X, and fancier fact-to-relational mappings might
also apply the constraints to the DDL as well with CHECK clauses. (Or, if
the target set is something stored in the database, an appropriate foreign
key constraint should get created.)
Cardinality constraints are also familiar DB concepts - keys and
indexes. Technically, index information isn't about cardinality per se,
but I think it'll be simplest to express index information and optimizing
hints via the same sort of structure as declaring cardinality or uniqueness
constraints.
Finally, ring constraints are kind of a specialty thing. I'd like to make
sure the infrastructure can support things like those, but I don't really
plan to do a full implementation of all the ORM-style ring constraints,
certainly not in the first version. :) It's more that I want to make sure
that the infrastructure can deal effectively with exotic constraints.
The object layer will provide lots of syntactic sugar for constraints, so
that common subset constraints can be specified compactly. For example,
you should be able to just specify a type or a validator of some kind and
have it just "know" you mean to make an attribute a subset of
that. Specifying more exotic constraints like mutual exclusion or set
equivalence should require some sort of wrapper, like 'exclude(...)' or
'exactly(...)' or something. Anyway, in the simple case:
foo = schema.One(Bar)
should just know you mean to make the 'foo' attribute accept 'Bar'
instances. More elaborately:
foo = schema.One(Bar, someValidator, otherConstraint)
would apply the other constraints to the attribute as well.
Queries
-------
A couple of years ago, I tried to create a peak.query library to implement
conceptual queries against arbitrary database back-ends. Unfortunately, I
only knew just enough about the field to be dangerous. :) I grokked the
domain relational calculus, tuple relational calculus, relational algebra,
some ORM query stuff, and I even figured out that Python listcomps are
isomorphic to the domain relational calculus, making them an ideal syntax
for expressing queries in Python. But I was a bit stumped as to how to
make it all work, between objects and a relational DB, or even multiple
DBs. It was just too complex.
I did, however, order the ORM text, "Information Modeling and Relational
Databases", a huge tome on the basis of conceptual modeling and elementary
fact orientation. And apparently I've beat myself over the head with it
long enough that some of it has sunk in, because when I mentally picked up
where I left off on peak.query, I found that my brain has filled in a
surprising amount of the gaps in what I did previously.
Partly, this is because modeling based on elementary facts is much simpler
than either the relational or object model. Elementary facts are simple
ideas like "The person with name 'Joe' was hired on '9/1/1991'", usually
representable as 2-tuples or 3-tuples. So, to evaluate a query like:
[p.birthdate, p.hiredate for p in Person if p.name=='Joe']
You simply convert it to the domain relational calculus as follows:
v1,v2: person_birthdate(v3,v1) and person_hired(v3,v2) and v3=='Joe'
which then simplifies to:
v1,v2: person_birthdate('Joe',v1) and person_hired('Joe',v2)
And then transform to the tuple relational calculus, assuming the
person_birthdate and person_hired facts map to an 'employees' table:
v1, v2: employees('Joe', v1, v2, ...wildcards for other columns)
And this then trivially maps to SQL:
select e1.birthdate, e1.hire_date from employees e1 where name='Joe'
But the real advantage comes in with stuff like:
[p.name, p.spouse.name for p in Person if p.spouse.hiredate < p.hiredate]
which becomes a DRC query of:
v1,v2: person_spouse(v1,v2) and person_hiredate(v1,v4) and
person_hiredate(v2,v5) and lt(v5,v4)
and then a TRC query of:
v1, v2: employees(v1,...,hiredate=v4,spouse=v2) and
employees(v2,...,hiredate=v5) and lt(v5,v4)
resulting in SQL of:
SELECT e1.name, e1.spouse
FROM employees e1, employees e2
WHERE e2.name=e1.spouse AND e2.hiredate<e1.hiredate
In other words, arbitrary joins, even cartesian joins, made fresh to order.
Now, none of these translation stages are *trivial*, mind you. They
absolutely depend on knowing things like what type 'p.spouse' will refer
to, and so on. But the basic process should also be extensible to multiple
databases or even no database, in the sense that pure in-memory data
structures should be just as queriable.
One of the things that made peak.query seem farfetched at its inception was
that I hadn't implemented generic functions yet, so I wasn't entirely
certain that I could implement a Python expression parser of this
complexity. Now, I've already got an implementation that works nicely for
everything but the listcomp structure itself. It should be a breeze to
update it, and I also now have experience doing the sort of abstract
expression interpretation and constant folding that the generic function
stuff does, and that this system will need.
Also, as it happens, the actual implementation of a system like this really
want to use generic functions for at least the SQL generation part. Lots
of silly little details vary between SQL implementations in ways that
really want to be rule-based.
The middle layer, DRC->TRC conversion and tuple merging, is pretty simple,
as the elmentary fact types are basically views onto tables. So, the
conversion basically consists of replacing the elementary fact types with
the tables they're views of, and filling in wildcards for the empty
slots. You then go through and unify any predicates that are against the
same table with matching variables in unique columns (or sets of unique
columns), thereby reducing to simple TRC.
For actual query generation, you can support multiple back-ends by having
each back end pull out the parts of the TRC that relate to it, replacing
them with pseudo-tables representing the portion of the query they can
handle. When all backends have done this, the remaining TRC predicates
have to be joined by the engine itself, to do any cross-database joins. If
there's just one predicate, the engine just iterates over it and yields the
results.
Of course, there are many devils in the details! In this simple overview,
I've avoided things like 'OR', outer joins, aggregate functions and the
like. Some of these things will be harder to implement than others, mainly
because I need to work out how to express them cleanly in my concept of how
the DRC/TRC intermediate representations will work, especially if they're
to map cleanly to SQL. The grammar isn't an issue, because things like
this are natural in Python:
[d.name, max(p.salary for p in d.employees) for d in Department]
Thus, the only sticking point is how to get from there to the SQL that may
need to use either subqueries or GROUP BY, and handling cases where the
backend doesn't support one of the needed features.
But, once these issues are resolved, peak.query will end up being the
easiest to use, most natural syntax possible for expressing database
queries, against zero or more physical back ends. Coupled with the other
features I've described, peak.schema should be able to be the premier
object-relational library -- not just for Python, but for any programming
language, hands down.
Any comments or questions so far? I haven't covered issues of caching or
transactions to any significant extent yet, mainly because that's (mostly)
beneath the fact layer, except insofar as transaction boundaries are likely
the appropriate time for certain kinds of constraint checking.
More information about the PEAK
mailing list