[PEAK] The O-R mapping layers

Mon Nov 10 13:39:10 EST 2003

Physical Layer
--------------

The lowest layer of the O-R mapping system will consist of a "database" 
component, with subcomponents representing tables, or properly speaking, 
relation variables, since they may actually be views or queries.

A "database" in this context does not mean a database connection.  A 
database component may contain (encapsulate) connections to more than one 
physical database, be they LDAP or SQL or something else altogether.  The 
actual connection objects are irrelevant outside of the db component, 
because all usage of the database is through reference to its "tables".

These tables will probably be something like 'storage.SQLTable' and 
'storage.LDAPTable', derived from the existing peak.query.AbstractRV class, 
once peak.query has gone through a bit more evolution.  In particular, 
they'll need to evolve methods such as:

*  __iter__ (so that you can do e.g. 'for row in db.someTable:')

* makeQuery() (to create a callable that then returns an iterable; this is 
so that queries can be pre-compiled and then cached as methods of the DB or 
of other objects.  Dynamic SQL generation involves quite a few calls, so we 
want to be able to reuse them without regenerating them all the time.)

* insert(), delete(), and update() methods of some sort, to allow data 
manipulation.

When combined with the existing '(where=..., join=..., etc.)' capability of 
peak.query relvars, some per-backend function mapping capabilities, and 
good ol' peak.binding, it should be possible to easily create a DB 
component for any given application's database needs.  Indeed, it should be 
relatively straightforward within this layer to mask minor backend-specific 
schema differences (such as the different date/time arithmetic functions 
supplied by Sybase vs. Oracle, for example, or columns that have to be 
renamed due to different reserved words on different backends).

All in all, this will give us a physical layer whose capabilities and 
simplicity of setup should either match or exceed those of most other 
existing "SQL mapping" libraries for Python.  However, ours will also work 
with LDAP, will allow constructing arbitrary joins and aggregate queries, 
and allow abstraction of functions and data type conversion across multiple 
backends --even at the same time.  And, to top it all off, it'll be 
workable with any data source that can be rendered as rows and columns of 
atomic data values according to some stable schema.

This alone will be a useful thing to have.  But then we'll add...

The Mapping Layer
-----------------

The mapping layer is responsible for defining the relationship between a 
peak.model type or feature, and a pair of projections over relvars provided 
by a database component.  Okay, now let me explain that in English...

A mapping component will have a binding to a database component, and more 
importantly, it will have bindings to various tables (relvars) provided by 
the database component.  Then, for each feature (attribute or method) of a 
type that maps to the database, the mapping component must provide three 
things:

1. The relvar (table, join, or other query construct) where the feature is 
found

2. A projection (ordered collection of column names) over the relvar, 
expressing where to find the type's primary key within the relvar.

3. A projection over the relvar expressing where the feature may be found.

Since relvars can contain arbitrary joins, aggregates, expressions, etc., 
this is an extremely general mapping mechanism.  You might think, by the 
way, that we could leave out the primary key projection and simply track 
each table's primary key, but it really can't work that way.  We also can't 
eliminate the projections by simply performing a projection on the relvar, 
because we don't want to join a table repeatedly (once for each feature 
that appears there).  Thus, we need to be able to know when two features 
are based on the same relvar and primary-key projection, in order to reuse 
that relvar+key.

For each type, we'll also need this relvar+projections data, intended to 
represent how to find canonical instances of the type.  That is, if we are 
trying to find "employees", where do we query for a list of their primary 
keys?  This information is need to form the root of conceptual queries, but 
is also needed if one wishes to obtain information from a particular 
subtype of some general type in a database.

It's possible that each mapping component will be intended for a specific 
object type, rather than being just a schema-wide mapping.  That would 
probably offer more opportunity for component reuse.  So, a given mapping 
class would just list the features of the peak.model type (probably as 
bindings), defining the relvar+projection lists.  Probably there would also 
be bindings to reference the relvars, and a class attribute or two to 
represent primary key projections, to make it easier for the other bindings 
to do their thing.  That is, most features would probably look something like:

      aFeature = binding.Make(lambda self:
          PathSegment(self.someTable, self.pk, ('a_column',))
      )

(A relvar+projection makes up a "path segment" in a conceptual query.)  In 
addition to the raw mapping data, it's likely that a mapping component 
would also indicate what features (if any) should be lazily loaded when the 
object is loaded from the database.  This is important for e.g. avoiding 
loading CLOBs or BLOBs unnecessarily, and would also allow the easy 
construction of a default query for the object.

Indeed, it might be possible for the base class of these mapping components 
to provide bindings for query methods (see 'makeQuery' in the physical 
layer notes) that would provide "get" and "find" operations on the object's 
"default state" query.

At this point it begins to sound as though the mapping components are in 
fact data managers, and I can see that caching policy perhaps should live 
with the mapping components.  But I'm not wholly convinced that DM's and 
mapping components are the same thing, nor of what shape DM's will 
metamorphose into by the time this design (and its implementation) are done.

Conceptual Queries
------------------

The conceptual query layer isn't really a component layer like the 
others.  Instead, it will be services provided by peak.model classes, like 
our old 'Employee.where()' example.  The final query syntax will actually 
resemble our original proposed syntax, with keyword arguments being a 
shortcut to express feature information, if they are being "and"ed 
together.  IOW,

Employee.where(foo=bar,baz=spam,__as__='emp')

is a shortcut for the following lower-level query expression:

DEFVAR('emp',True,
    TYPE(Employee,
        AND(
            FEATURE('foo',EQ(bar)),
            FEATURE('baz',EQ(spam))
        )
    )
)

(or something similar).  Anyway, it should be possible to convert the 
resulting object into an executable query, perhaps by passing it to the 
container that holds all the mapping components.  Actually, it should also 
be possible to convert it into just a relvar, where it might then be 
possible to do insert/update/delete operations, depending on the nature of 
the underlying query.

And that's it.  Those are our layers.  The higher we go in the layers, the 
less precisely we know how they'll work, because they depend on things that 
are still uncertain in the lower layers.  But, as we move forward these 
things should be able to be sorted out pretty clearly.

I caution you, however, that if you currently depend on the existing "data 
manager" framework, you should be aware that it may change significantly by 
the time this O-R mapping system is complete.  That's because we're likely 
to move away from the notion of "DM as mapper to arbitary data source" and 
more towards "DM as a user of relvar(s)".  Given the new tools for relvar 
manipulation, I believe we'll wind up in a place where such manipulation is 
something you do outside the DM, which will simply receive a ready-to-use 
data structure.  And, if you are mapping to other kinds of external systems 
(e.g. Ulrich's IMAP work), you'll simply work on exposing those systems as 
relvars.  That is, table-like constructs of atomic values, that can 
potentially be joined with other data sources.