[TransWarp] Towards a query theory, part 5: Object-Relational Mapping
Phillip J. Eby
pje at telecommunity.com
Wed Oct 15 21:25:13 EDT 2003
Let's take a moment and circle back to the place this all began: John
Landahl's TableDM and discussion of queries.
We want to use the mechanisms described thus far to allow us to map an
object model onto a relational database, and conversely, to apply
relational queries to our object models. Can we do this?
Let's start with filters. Filters are applied to some starting context: an
RV representing some data type, such as Employee or Customer. We've been
talking about them as though the data type was an actual table, but of
course it can be any RV... so the Employee "table" could actually involve
various joins, column renames, and type conversions. As relational
operators are applied to this RV by a filter, the built-in optimization
will push column removals and criteria downward, and type conversions
upward. The end result is a precisely-optimized query *based on the
underlying relational schema*!
Now you can see why I've been pushing so hard for certain properties in
this theory, such as a purely functional model in which all objects are
essentially immutable. This lets us create *one* RV expression for each
class in our model, that determines the mapping from the "ideal"
representation to the actual physical representation. Because it's
immutable, we can "share" (or at least reuse) the same instance for every
query we want to do over that object type. In principle, we can also
describe one physical schema in terms of another, which might be useful for
dealing with name changes in a schema that are required when moving between
databases with different reserved names.
So let's talk briefly about the "ideal" representation of an object in
relational form. It should be a representation that can automatically be
determined from information in the peak.model metadata for the object
type. Presumably, this representation would keep the same column names as
are used for the corresponding attribute names, except where the attribute
is a collection, in which case the collection would be represented as
another "table" with an appropriate join mechanism to the main
table. Bidirectional one-to-many relationships would use a standard
foreign key mechanism, while many-to-many relationships would have a
(logical) join table.
Given this "ideal" form, it should then be possible to use the relational
operators to transform the actual database schema into the "ideal"
form. We would then need a mechanism to allow us to find the correct RV
expression for a given type (or perhaps collection) within our
schema. Perhaps we could do this via adaptation to a protocol representing
membership in the target schema.
Thus, to retrieve all the "flat" data for a particular object type, it
would be necessary only for a DM to execute retrieval of the appropriate RV
with a 'select()' added to pick the right object(s), and possibly a remap()
to drop any attributes that should be
lazily-loaded. (Hmmm... lazy-loading is an implementation-side hint, so
the DM shouldn't know this if it's physical schema-independent. There's
probably another adaptation issue present here.)
Anyway, if we also make it possible to apply INSERT, UPDATE, and DELETE
operations to an RV, then mapping from objects back to the database is also
doable. And voila! Physical storage independence at the logical
abstraction layer for all possible backends, including ones that span
different physical databases.
So what's missing? Well, I just did some handwaving regarding type
conversions, not to mention INSERT/UPDATE/DELETE. Let's tackle type
conversions (aka functions) first.
In my last segment, I mentioned the need to deal with functions. But I was
considering them a sidetrack, not something too important for the core
model. As it happens, I was wrong. In order to wrap an O/R mapping into a
collection of relational operators, it's a critical capability. Heck, it's
a critical capability just for writing backend-independent queries that do
date/time manipulations! (Most databases have a mutually incompatible
mishmash of date/time types.)
Let's say I want to do something as simple as finding items billed between
two dates. I'd definitely prefer to express the query in terms of Python
date/time types, specifically whatever ones my logical model uses. Of
course, that means I'll need to have conversion functions to map to/from
the types used by my physical database. And, if we use the "functions
float as high as possible" idea from my last posting, then we will only
convert retrieved data to Python types once it's retrieved from the
database. (That is, our top-level RV will convert the values in each row
returned from the underlying SQL or other query.)
But, what about criteria? If we only float conversions up in the operator
tree, then the conversion will become "stuck" at the point where our date
range conversion occurs, and our query will end up retrieving all data from
the underlying DB, regardless of billing date, then use Python to convert
the date field, and then check whether the date is in range. This isn't
acceptable for our intended purposes.
What can we do? Most type conversions *must* take place at the Python
level, because if the database could convert to the desired type, we could
just use the right type in the first place! Since the conversion must take
place in Python, our only option in searching for a date range is to
convert the date range to the type expected by the database. In other
words, we must perform *inverse type conversion* on all constants and input
parameters.
The algorithm would probably need to be part of the mechanism used to float
functions upward. Whenever we encounter a comparison that would otherwise
keep the function from floating further upward, we can check if the
function has an *inverse function*. If so, we can apply the inverse
function to the opposite side of the comparator, as long as it is a
constant or input parameter. If it's a constant, we can apply it
immediately and compute the new constant. If it's a parameter, we need to
somehow annotate it with the function conversion.
The mechanism needed for this isn't clear to me yet, any more than input
parameters are really clear to me yet. Most of yesterday and today's
architecture is almost at "code-level clarity" (i.e., I could sit down and
try to write it), but input parameters in general are still an odd
beast. I'll come back to them in a minute, after verifying that
conversions are handled.
Okay, so we can float functions and conversions upward, flipping them to
their inverse where they touch criteria, so that the maximum number of
functions appear at the boundary between Python and an underlying DB
scheme. (Note that even if a DB implements a function directly, we still
want it to do the computation at the last possible moment, to minimize the
number of invocations.)
As an alternate way of looking at functions and type conversion, we could
perhaps view an invertible function as a kind of virtual table (the way
Gadfly 2 does). To implement a function, we could simply join from the
applicable table to the function's table. Since joins are flattened to
whatever extent is possible, and everything else tries to go below joins,
then the functions will naturally "float to the top".
If a function is represented by joining an RV operator, then our normal
attempt to push criteria to their most-specific RV will find us comparing
one of the function's "columns" to our date range. So, when the functions
.select() is applied with constant logical expressions on the converted
date column, the function can optimize this by performing the inverse
conversion on the constants, and applying them to the unconverted column.
This is a little bit messy, since we really want the criteria to move back
to one of the columns joined with the conversion table. There may be a
clean way to do this in the logic that moves criteria down from a join, or
more likely, in the logic that moves remaps (projections) downward.
When we move a remap down through a join, we could drop the conversion
tables and copy their criteria back to the column(s) joined on. However,
there would probably need to be a specialized join type for this, since
under normal circumstances you can't just toss out a joined table just
because you're not retrieving any non-joined columns. That *is* safe with
outer joins, though, so we can model functions and type conversions as an
outer join to a virtual table.
So now we can have a standard RV type representing a function or type
conversion. If the function is supported by a database to which the query
is applied, it turn the outer join and virtual table back into a function
application. I'm not 100% clear on how that will work; we'll look at it
again when we look at SQL generation algorithms. Outer join syntax (and
even the ability to do them in the first place) often varies quite a bit
between databases, though, so it's quite possible that the table->function
restoration operation will be *easier* than the code generation for *real*
outer joins!
By the way, in case you're wondering why I opted to explore the virtual
table and outer join route, it's for two reasons:
1. It eliminates the need to somehow "mark a parameter as needing type
conversion", and
2. It allows us to express queries on *expressions* and *object methods* --
even ones with parameters -- as part of our fundamental query language!
You see, now that we know how to express a function as a table, and apply
it to a column by way of an outer join, we are not limited to simple
two-column conversion functions, oh no. We can define any object method
M(self,p1,p2,...pN) as a table with N+1 columns, e.g.
(Result,self,p1,p2,...,pN). If there is a native implementation of this
method for a given back-end, we can provide it as a usable base RV. If
not, then a Python implementation of the function can be provided.
In practice, you will not often use this method aspect, and when you do,
the RV method table will need to also include columns for any other input
data used by the method. (That is, the 'self' column has to be replaced
with a collection of columns representing other attributes used to
implement the method.)
But, you *can* use this for expressions. Addition, subtraction,
multiplication, division, etc. Once again, outer joins to the virtual
tables, that SQL generation can convert back to infix operators. And get
this... if you search for 'Some.field+1 == 2', it'll get converted
internally to 'Some.field == 1', because of the constant-folding mechanism
described previously. Pretty cool, eh?
So let's see if input parameters are cleaner now. Presumably, each RV will
simply include its parameters as part of its column list. The parameters
would come from joined tables and from parameters used by selection
criteria. Since parameter-based criteria float down to the table they
apply to, input parameters will first appear at their usage point, and be
copied upwards to the top of the query.
At any point, a parameter may be "required" (i.e. the RV can't execute
without a value for it), or "optional" (it can be specified, but the RV can
run without it.) When a join pairs an optional and a required parameter,
the parameter is optional with respect to the joined RV, so parameters stop
being required as soon as they can be joined with another column that's
labelled the same. When this happens, the criteria that were applied to
the parameter in the table where it was "required" should be rephrased as a
join condition at the point where the parameter became optional.
Suppose we treated parameters as virtual tables, too? Then, when we
compare to a parameter (as an inequality), this is the same as creating a
join between the current RV and the parameter table, and the criterion
stays at the join. If we compare equal to a parameter, that's the same as
a remap equating the column to the parameter, which is the "optional" form
of parameter. As joins are flattened into larger joins, the "required"
parameter table will move up in the join, until/unless it is joined with
its "optional" form, at which point the inequality can be changed to
reflect this.
Let's see how this would work in our old "same city different country"
query from part 1... beginning with the conceptual form:
Employee1
+ - lives in City1
+ - was born in Country1
+ - supervises Employee2
+ - lives in City1
+ - was born in Country2
+ - is not Country1
Restated as a "friendly" Python form:
aCity = City.where()
aCountry = Country.where()
coResidentForeignSupervisors = Employee.where(
livesIn = aCity,
bornIn = aCountry,
supervises = Employee.where(
livesIn = aCity,
bornIn = IS_NOT(aCountry)
)
)
And restated as a primitive filter:
crfs = AND(
FEATURE('livesIn', EQ(VAR('aCity'))),
FEATURE('bornIn', EQ(VAR('aCountry'))),
FEATURE('supervises',
AND(
FEATURE('livesIn', EQ(VAR('aCity'))),
FEATURE('bornIn', IS_NOT(VAR('aCountry'))),
)
)
)
Let's convert that to relational operators now, by calling
crfs.buildQuery() with an "Employee" RV...
First, the AND() passes the employee RV down to each feature's
buildQuery(). livesIn and bornIn both map to the same RV, and just tag the
appropriate columns as being the named parameters. The 'supervises'
feature takes another reference to the Employee RV (it's a self-join), and
remaps it so that the 'supervisor' column is also labelled with an
autogenerated correlation parameter name. It then passes this relabelled
Employee RV down to the next AND() object.
The inner AND() object passes the down to the two features, which of course
apply to the same table. The EQ() labels 'livesIn' with 'aCity', but the
IS_NOT() can't do this for 'aCountry', so it joins to a parameter table for
'aCountry', using an IS_NOT join criterion, and then returns the joined
(and relabelled) table.
This table is returned by AND() to the FEATURE('supervises'), which now
joins it with the original Employee RV that's been annotated with the
'aCity'/'aCountry' parameter names. So, our structure looks like:
supervisor = remap(EmployeeRV, aCity='livesIn', aCountry='bornIn', ...)
supervised = theta_join(
IS_NOT('bornIn','aCountry'),
remap(EmployeeRV, aCity='livesIn', privateCorrelation='supervisor', ...),
parameterTable('aCountry')
)
the_query = natural_join(
remap(supervisor, privateCorrelation='employeeId',...),
supervised
)
The '...' in the remaps are to indicate that all other columns are kept as
is, e.g. foo=foo, bar=bar, for all columns previously present. The final
step before execution is to apply a remap that deletes all non-output columns.
Okay, so when we join the supervisor and supervised intermediate RV's, we
are going to combine the natural join and the theta join, ending up in a
theta join that looks like:
theta_join(
AND(
IS_NOT('t2.bornIn','t3.aCountry'),
EQ('t1.livesIn', 't2.livesIn'),
EQ('t1.employeeId','t2.supervisor')
EQ('t1.bornIn', 't3.aCountry')
),
remap(supervisor, privateCorrelation='employeeId',...),
remap(EmployeeRV, aCity='livesIn', privateCorrelation='supervisor', ...),
parameterTable('aCountry'),
)
At this point, we can drop the parameter table from the join, resulting in:
theta_join(
AND(
IS_NOT('t2.bornIn','t1.bornIn'),
EQ('t1.livesIn', 't2.livesIn'),
EQ('t1.employeeId','t2.supervisor')
),
remap(supervisor, privateCorrelation='employeeId',...),
remap(EmployeeRV, aCity='livesIn', privateCorrelation='supervisor', ...),
)
Which is just what we wanted. So it's doable, although far from obvious
how to fit this cleanly into the framework so it happens
automatically. Oddly, this still seems cleaner to me than special "input
parameter" columns, because this approach naturally causes the parameter to
float up with its criteria still attached, putting it in a good position to
be combined with another instance of the parameter. The "input parameter"
approach seems like it forces you to reach down into an underlying RV and
yank out its criteria.
One other result of going through this exercise, by the way, is the
realization that our filter-to-relational conversion algorithm isn't quite
finished. The algorithm in part 3 had a slightly different notion of what
constituted an RV, from the very precise model we have now. Traversals
need to operate a little differently, and there are some steps that were
left out. For example, each traversal that generates a new "table alias"
has to put a remap on it so its columns are named with an 'alias.column'
pattern.
The new twists mean that the so-called "Final Algorithm" needs to be
revised. Part of the problem is that our filters concept mixes relational
operators in a somewhat unnatural fashion. We need to separate the join
structure from the criteria structure, so that buildQuery() returns three
things:
1. an RV
2. a logical expression that should be applied to the RV() as a .select()
3. a list of output parameters, that should be used to .remap() the RV,
leaving only the outputs in the result
Only the "original" caller of buildQuery() should actually apply the remap
or select, because the criteria or parameter list may otherwise be incomplete.
So, now traversal operators can modify the RV as it goes down the tree, and
should provide context to their children that tells them what alias name(s)
to use for columns. Logical operators combine their children's logical
expressions according to the appropriate operator, and pass the other two
items back up unchanged. Actually, it's more like:
outparms = {}
terms = []
for child in self.children:
contextRV, logex, parms = child.buildQuery(contextRV, contextDV)
terms.append(logex)
outparms.update(parms)
return contextRV, self.__class__(*tuple(terms)), outparms
That is, we keep each child's version of the RV as the version we pass to
the next child, and we keep all the parameters from each child. Notice
that contextDV remains unchanged throughout, because a logical operator has
no effect on the current traversal position.
Also, what I previously called the contextDV is probably actually going to
be some sort of "schema traverser" object that understands how to find the
right column for a feature, and/or create or join to a new table
alias. This traverser would know what its current table alias name is so
it can generate column names. Traversal operators would act on this
traverser, to get a new traverser to pass to child expressions, and modify
the RV they pass down accordingly. Traversals only occur downward, so
traversal ops throw away the new traverser when they return, and they
return their child's return values unchanged.
Comparison operators create a version of themselves with an explicit
operand, using the current traverser (contextDV). They return this as
their 'logex' return value, leaving the contextRV unchanged in the
return. (Unless they are an equality comparison and the RHS is a
parameter, in which case they remap the contextRV to include the equality.)
VAR() operators remap their contextRV to indicate that the current column
is equal to the variable, then pass it down to their child. On the return,
if the VAR() is flagged for 'show', then they also add the variable name to
the output parameter list.
And so it goes, until the entire logical structure is created. So, our
previous example should return:
rv = theta_join(
AND(
EQ('t1.employeeId','t2.supervisor'),
EQ('t1.livesIn','t2.livesIn'), # courtesy of matching labels
),
EmployeeRV, aCity='livesIn', aCountry='bornIn', ...)
EmployeeRV, aCity='livesIn', ...)
)
logex = AND(
EQ('t1.livesIn', VAR('aCity')),
EQ('t1.bornIn', VAR('aCountry'))
EQ('t2.livesIn', VAR('aCity')),
IS_NOT('t2.bornIn',VAR('aCountry')),
)
outparms = {}
We now perform 'rv.select(logex)' which will apply the criteria as close to
the underlying tables as possible. Specifically, it will push the criteria
into the theta join's AND(), and the theta_join will push down any criteria
that point to a single table, after attempting to unify any parameters with
the same table as the opposite side of the criterion, or failing that, any
table that provides the parameter. If the parameter is not present, then
it's an error: the parameter was never defined in a way that was usable.
So, by waiting to apply selection criteria until the entire join structure
is complete, we don't have to do anything funky to unify correlation or
input parameters. No crazy joins, no special handling to detect required
vs. optional parameters, nothing. Very cool.
We still need our traversal system to detect when it should autogenerate an
EXISTS() join, but that's practically an implementation detail. Functions
and data type conversions are part of the join structure, and so are
effectively part of traversal. Any comparison applied to a function output
gets the right explicit operand for the generated name of the output
column. Then, when the criteria are applied to the final join structure,
the pushdown and possible constant-folding occurs.
So what have I left out? INSERT/UPDATE/DELETE, of course, but also some
details of the constant-folding mechanism for function
tables. Specifically, the algorithm I've given so far for doing it, only
works if the converted value isn't used in the output. So really, there's
going to need to be special handling so that when the value *is* used in
the output (e.g we request the billing date, *and* that it be between two
dates), the constant-folding still takes place and the criteria move to the
original table column.
But I think that's a subject to tackle in a future missive, where I'll deal
with traversal operators, join construction, and join folding in more detail.
More information about the PEAK
mailing list