[TransWarp] Notes on part 6

Fri Oct 17 00:56:19 EDT 2003

At 10:29 PM 10/16/03 -0400, Phillip J. Eby wrote:
>Whew.  That wasn't so bad.  Doing a quick review of a few ConQuer->SQL 
>examples show that there's still a devil of a lot of details to get right, 
>such as defaulting an RV in the theta_join if the start-DV is never given 
>an RV, and tracking output variables from MAYBE outer joins.  However, the 
>relational join structures created by these rules seem to closely mirror 
>the SQL generated by ConQuer, at least for the examples I reviewed.  But, 
>I haven't reviewed any queries involving aggregates, mainly because I 
>haven't established a filter syntax for them, let alone a conversion algorithm.

A few missing details...

* NOT() expressed over a FEATURE() with no nested conditions will mean a 
NOT IN (SELECT ...) if the FEATURE(s) create joins, but need to be treated 
as a 'column IS NULL' if no joins occur under the NOT.  Ugly, but true.  I 
should check whether there are any other special cases that are created by 
the absence of joins occurring "near" subqueries or joins.

* A nice property of the current plan is that it completely wipes out the 
issue of outer-join precedence.  As it happens, the "outer joins fanning 
out from a collection of inner joins" structure we're using is guaranteed 
free from precedence conflicts.  It also happens that in cases where outer 
join columns end up having criteria attached, we can "innerfy" the 
join.  This will probably be a common situation where we have a 
object-relational view that includes outer joins, and we're expressing 
criteria on one of the outer-joined tables.  So, as long as logical 
expressions know whether they're "non-null" (i.e., know if they return 
false if they compare to a null), it's possible to remove the "outerness" 
of the join, as it's guaranteed that the "null" rows wouldn't get returned 
anyway.

* We've ignored compound keys and ternary (or higher-arity) relationships 
in discussion so far.  Our main use cases may not include these as a 
requirement, though.  Compound keys can't cleanly be used for IN (SELECT 
...) subqueries.

Man, it's getting easier to see why databases are so expensive.  We haven't 
even *touched* the issues of physical-level optimization, concurrency, 
updates, etc.

It seems clear to me also, that as I work up the libraries for this, I'm 
going to need to have ways to "leave out" some features for now, so we can 
get something usable, sooner.  One of the easiest ways to do that would be 
to focus first on implementing the relational algebra, and skipping the 
concept filter -> relational translation subsystem for the time being.  But 
note that relational algebra is like SQL, only uglier.  :)  Even with the 
advantage provided by view wrappers, it's going to be very messy to use.

For my project at work, I still need to come up with a slightly more 
compact notation for the conceptual query syntax, as we are planning to go 
through an existing application and convert its queries to a conceptual 
form.  Sadly, it might be less effort to then manually convert those 
queries to relational algebra form, than to write the automatic 
conceptual->relational translator.  Oh well, one thing at a time.  I think 
I've got relational algebra down pat now, at least for the subset needed to 
implement conceptual queries, and adding aggregates to a relational algebra 
framework is just another RV operator.

So, it's definitely looking like the relational algebra framework is going 
to be the thing to code first, although a conceptual query notation is also 
still important to decide on.  (The latter doesn't have to be written as 
valid Python expressions, though; we just need a working syntax.)

We can then work on algebra -> SQL/LDAP translation on the one hand, and 
conceptual filter -> algebra translation on the other, to complete our 
conceptual -> SQL/LDAP stack.  This is definitely a major project in all, 
and not nearly as "incremental" in nature as most work that's been done 
with/on PEAK to date.  Further, the overall risk is what I'd consider 
rather scary, which means that it's probably up to a level that most 
project managers would never even notice, but then I'm really paranoid.  ;-)

But I think the risks should be mitigated quite a bit by starting with the 
algebra (which is very straightforward to implement, and based on decades 
of solid mathematics), and then moving to SQL query generation (which 
absolutely has to work if the end result is going to be useful).  So we 
should be able to get feasibility feedback before too long.  It'll just be 
a longer time than I'm accustomed to for getting that kind of feedback.

Damn, it's almost 1 AM.  I'd better get some sleep if I'm to have any 
chance of getting to work tomorrow.  As it is, I've been at home sick since 
Tuesday, which means I've spent nearly all my waking hours working on this, 
instead of being distracted by other things at the office, like all the 
other work I should be doing.  :)