[TransWarp] Notes on part 6
Phillip J. Eby
pje at telecommunity.com
Fri Oct 17 00:56:19 EDT 2003
At 10:29 PM 10/16/03 -0400, Phillip J. Eby wrote:
>Whew. That wasn't so bad. Doing a quick review of a few ConQuer->SQL
>examples show that there's still a devil of a lot of details to get right,
>such as defaulting an RV in the theta_join if the start-DV is never given
>an RV, and tracking output variables from MAYBE outer joins. However, the
>relational join structures created by these rules seem to closely mirror
>the SQL generated by ConQuer, at least for the examples I reviewed. But,
>I haven't reviewed any queries involving aggregates, mainly because I
>haven't established a filter syntax for them, let alone a conversion algorithm.
A few missing details...
* NOT() expressed over a FEATURE() with no nested conditions will mean a
NOT IN (SELECT ...) if the FEATURE(s) create joins, but need to be treated
as a 'column IS NULL' if no joins occur under the NOT. Ugly, but true. I
should check whether there are any other special cases that are created by
the absence of joins occurring "near" subqueries or joins.
* A nice property of the current plan is that it completely wipes out the
issue of outer-join precedence. As it happens, the "outer joins fanning
out from a collection of inner joins" structure we're using is guaranteed
free from precedence conflicts. It also happens that in cases where outer
join columns end up having criteria attached, we can "innerfy" the
join. This will probably be a common situation where we have a
object-relational view that includes outer joins, and we're expressing
criteria on one of the outer-joined tables. So, as long as logical
expressions know whether they're "non-null" (i.e., know if they return
false if they compare to a null), it's possible to remove the "outerness"
of the join, as it's guaranteed that the "null" rows wouldn't get returned
anyway.
* We've ignored compound keys and ternary (or higher-arity) relationships
in discussion so far. Our main use cases may not include these as a
requirement, though. Compound keys can't cleanly be used for IN (SELECT
...) subqueries.
Man, it's getting easier to see why databases are so expensive. We haven't
even *touched* the issues of physical-level optimization, concurrency,
updates, etc.
It seems clear to me also, that as I work up the libraries for this, I'm
going to need to have ways to "leave out" some features for now, so we can
get something usable, sooner. One of the easiest ways to do that would be
to focus first on implementing the relational algebra, and skipping the
concept filter -> relational translation subsystem for the time being. But
note that relational algebra is like SQL, only uglier. :) Even with the
advantage provided by view wrappers, it's going to be very messy to use.
For my project at work, I still need to come up with a slightly more
compact notation for the conceptual query syntax, as we are planning to go
through an existing application and convert its queries to a conceptual
form. Sadly, it might be less effort to then manually convert those
queries to relational algebra form, than to write the automatic
conceptual->relational translator. Oh well, one thing at a time. I think
I've got relational algebra down pat now, at least for the subset needed to
implement conceptual queries, and adding aggregates to a relational algebra
framework is just another RV operator.
So, it's definitely looking like the relational algebra framework is going
to be the thing to code first, although a conceptual query notation is also
still important to decide on. (The latter doesn't have to be written as
valid Python expressions, though; we just need a working syntax.)
We can then work on algebra -> SQL/LDAP translation on the one hand, and
conceptual filter -> algebra translation on the other, to complete our
conceptual -> SQL/LDAP stack. This is definitely a major project in all,
and not nearly as "incremental" in nature as most work that's been done
with/on PEAK to date. Further, the overall risk is what I'd consider
rather scary, which means that it's probably up to a level that most
project managers would never even notice, but then I'm really paranoid. ;-)
But I think the risks should be mitigated quite a bit by starting with the
algebra (which is very straightforward to implement, and based on decades
of solid mathematics), and then moving to SQL query generation (which
absolutely has to work if the end result is going to be useful). So we
should be able to get feasibility feedback before too long. It'll just be
a longer time than I'm accustomed to for getting that kind of feedback.
Damn, it's almost 1 AM. I'd better get some sleep if I'm to have any
chance of getting to work tomorrow. As it is, I've been at home sick since
Tuesday, which means I've spent nearly all my waking hours working on this,
instead of being distracted by other things at the office, like all the
other work I should be doing. :)
More information about the PEAK
mailing list