[PEAK] ORM with the Trellis

Mon Feb 18 19:07:21 EST 2008

So, this post is a brain dump of my current plan(s) for developing 
object-relational mapping support for the Trellis.

Basically, the idea is that you will write simple rules to transform 
objects into flat records, akin to what one might use in a relational 
database.  Other rules, in turn, will map back from the flat records 
into objects.

Of course, anything other than flat values or simple 1:1 references 
will be slightly more complex, requiring the use of virtual data 
structures that translate operations on the collections into set 
operations on record sets.

(Being trellis-based, of course, these virtual structures should only 
need to be implemented once, and made to work with any underlying 
trellisized collection of records.)

However, this actual "object" layer is actually the least 
"interesting" part of what needs to be done, because it's really 
nothing more than an API facade, wrapping a record-oriented API with 
an object-oriented one.  And the trellis should make that pretty 
straightforward.

So, the more interesting parts, in my view, have to do with 
developing the record- and recordset-oriented APIs, and the 
lower-level mappings of these APIs onto storage back-ends.

My plan is for the record-level APIs to be based on the tuple 
relational calculus (TRC), with extensions for grouping and the 
like.  The TRC is also the basis of SQL, and it should be relatively 
straightforward to turn TRC specifications into SQL queries.

Equally important, the TRC is also a good intermediate form for 
translating object-based list comprehensions.  List comprehensions 
can be considered a superset of the domain relational calculus (DRC), 
a close neighbor of the TRC.

The idea is to be able to take a string like:

     [(p, c) for p in Persion for c in p.children]

And turn it into a TRC representation like:

     c = ResultVar('c')
     p = ResultVar('p')
     PersonRecord(c, parent=p)

And from there to SQL like:

    select person.parent_id as p, person.id as c
      from Person

And then turn the results from that query, back into an iterable of 
pairs of Person objects.

Ideally, it should be possible to use such comprehension expressions 
to define parameterized queries and "views".

Indeed, the TRC layer should make it possible to have writable views, 
expressed as TRC expressions.  This would let you keep your 
object-record mappings the same, regardless of your ultimate back-end 
storage schema, since a TRC expression like PersonRecord may in 
practice expand out to a join of multiple tables (for example).

However, the TRC layer in and of itself will not really "do" anything 
much.  It will primarily be a means of representing (and 
algebraically manipulating) records, queries, and the associated 
schemas.  It will be the responsibility of a "back end" layer to 
actually map TRC expressions into recordsets, and to support the 
addition, deletion, or modification of records.

So, it looks like there will be at least three segments to the ORM 
roadmap.  An object-record layer, the TRC layer, and the "back 
end".  The TRC layer will probably depend heavily on PEAK-Rules, 
which is approaching its first alpha release.  The back end will 
depend on Contextual (in order to offer storage services).

The TRC layer will also use Contextual, in that I want records and 
queries to offer direct operations that then map to back-end 
operations.  In other words, it should be possible to do adds and 
updates using "+RecordType(value, value, ...)" and deletes using 
"-RecordType(somekey=someValue)".  It should also be possible 
to  iterate over TRC queries.  All of these operations will be 
implemented in the TRC layer by delegation to context-specific 
back-end services, so that you can do things like:

     with someDB:
         +Foo(...)
         -Bar(...)
         # etc.

to make changes in a specific database.

There will also need to be a default, in-memory back-end service for 
these operations.  The memory back-end will be the first back-end implemented.

Once there's a TRC layer and a memory back-end, it will then make 
sense to begin work on the object-record mapping layer, although 
there will probably be some points at which I'll need to do some 
preliminary design on that layer in order to drive decisions being 
made at other levels.  Next, mapping object-based comprehension 
queries to TRC queries will get us a complete ORM...  albeit a 
memory-based one!

At that point, work on transactions, sessions, and the like will need 
to take place, as well as mapping to an SQL-based backend.  SQLite 
will be the initial mapping target.

There is some possibility of using existing SQL libraries (such as 
SQLAlchemy) as backends for the TRC layer.  I haven't investigated 
this very much at this point, but it would likely be 
helpful.  Unfortunately, it appears at first glance that SQLAlchemy 
is based on the relational algebra, rather than either of the 
relational calculi, which makes it less suitable (IMO) as a direct 
target for automated translations, especially of one schema to another.

It's one of those cases where a product is "infuriatingly good" -- 
i.e., so potentially useful that the remaining barrier(s) to using it 
are quite frustrating.  It's frankly quite tempting to just glue the 
Trellis directly to SQLAlchemy as closely as possible, and be done 
with it.  There'd be no in-memory support (except by using SQLite in 
memory mode, I suppose), but it'd certainly be a lot less complex of a project.

However, the major disadvantage to doing that, is that Chandler's 
sharing framework depends on EIM -- a record-oriented type 
system.  So we will want to be able to define EIM records and use 
them as a basis for queries and other types of operations.  And the 
TRC layer in the design above is basically a fancier version of EIM, 
that allows querying and add/remove/update operations.

So, the catch is, we want to define a mapping from objects to records 
once and only once, so that we get both storage and sharing for the 
price of one schema definition.  And I don't think that SQLAlchemy's 
schema information is sufficient for what we need in EIM.

That means that a TRC layer pretty much has to exist, even if it 
isn't much more than a wrapper over SQLAlchemy's schema definition 
system.  And, because it needs to be able to transform records 
directly, it almost certainly isn't going to be able to rely on 
SQLAlchemy for schema translation.

However, it may be that I can simply not bother with implementing an 
in-memory backend, at least in the short run.  It might make as much 
sense, or more, to use SQLAlchemy from the start, with an SQLite 
memory db for testing.  EIM, after all, is based on simple records 
and record sets, not complex queries.  On the other hand, Chandler's 
existing EIM library does have to do some limited in-memory 
indexing...  so that aspect of things still has to be implemented.

Some preliminary conclusions:

* An EIM-like TRC layer appears to be a must; SQLAlchemy doesn't have 
one, and it's needed for Chandler sharing.  In addition, databases 
aren't the only source of records that Trellis applications will want 
to work with.  (For example, I've been thinking about writing a GUI 
to control my VOIP conferencing system in real-time, displaying 
events as people go on/off line and speak, etc. )

* SQLAlchemy looks very promising as a back-end, and even potentially 
as a front-end mapper.  It supports inheritance, polymorphism, 
lazy/eager loading, multi-object rows, cascades, and just about every 
other SQL feature I can think of.

So, next steps should probably be working on the design of the TRC 
layer, and investigation of (and/or stealing ideas from) SQLAlchemy.