[TransWarp] Towards a query theory, part 1: filters and correlation

Wed Oct 15 10:15:49 EDT 2003

At 12:41 AM 10/15/03 -0500, Ian Bicking wrote:
>Given this, maybe it's not that hard to support cross-database (or 
>non-database) queries and joins to those queries.  You identify all the 
>anonymous objects in the query.  Whenever one anonymous (or not anonymous, 
>I guess) object has to be compared with another object not in the same 
>database, you fetch both objects and execute the comparisons in 
>Python.  But that might be horribly inefficient, I'm not sure -- if you 
>lose the ability to index because of this, and replace it with linear 
>searches, you're in a bad place.  But I haven't thought about it much yet.

Well, assuming it's an equijoin, you'd load the "short" side into a hash 
table and then iterate over the "long" side to do the join.  Unless the 
"long" side is in SQL and the short side is short enough to dump the data 
into an IN() clause on the SQL, or perhaps insert into a temporary 
table.  That sort of info is up to the DB driver to determine.  I believe 
that Gadfly 2's greedy join optimization algorithm is sufficient to cause 
the short-side to be executed first.  It would then be up to the long-side 
table to decide how to accomplish the join.  Really, the actual efficiency 
that results will depend mostly on how accurately each database estimates 
the "cost" (both in time and rows returned) of doing the join on their side.

>>>SQLObject's basic metaphor is one of tightly encapsulated row/objects, 
>>>so we have to produce objects, not arbitrary tuples/relations.  I.e., we 
>>>have to find instances of Employee, not just cities, or a combination of 
>>>employees and their supervisors.
>>
>>Understood.  For purposes of conceptual simplification, I'm considering 
>>selecting an object to mean selecting the *primary key* of the object, 
>>not any of its attributes.  This allows me to translate to "standard" 
>>relational theory.
>
>If you don't consider updates (which in the span of one query is not 
>really necessary) you can think of the entire tuple as a unit with the 
>same effect.  But  getting the whole row is basically an optimization, not 
>a fundamentally different thing.

Right.  In the current form of the theory, one could simply start with a 
context RV that already had all its columns marked as output parameters, 
and one could alwyays have a '*' traversal operator that labelled all of 
the context RV's unlabelled columns as output parameters according to some 
naming convention.