[TransWarp] Towards a query theory, part 1: filters and
correlation
Phillip J. Eby
pje at telecommunity.com
Wed Oct 15 10:15:49 EDT 2003
At 12:41 AM 10/15/03 -0500, Ian Bicking wrote:
>Given this, maybe it's not that hard to support cross-database (or
>non-database) queries and joins to those queries. You identify all the
>anonymous objects in the query. Whenever one anonymous (or not anonymous,
>I guess) object has to be compared with another object not in the same
>database, you fetch both objects and execute the comparisons in
>Python. But that might be horribly inefficient, I'm not sure -- if you
>lose the ability to index because of this, and replace it with linear
>searches, you're in a bad place. But I haven't thought about it much yet.
Well, assuming it's an equijoin, you'd load the "short" side into a hash
table and then iterate over the "long" side to do the join. Unless the
"long" side is in SQL and the short side is short enough to dump the data
into an IN() clause on the SQL, or perhaps insert into a temporary
table. That sort of info is up to the DB driver to determine. I believe
that Gadfly 2's greedy join optimization algorithm is sufficient to cause
the short-side to be executed first. It would then be up to the long-side
table to decide how to accomplish the join. Really, the actual efficiency
that results will depend mostly on how accurately each database estimates
the "cost" (both in time and rows returned) of doing the join on their side.
>>>SQLObject's basic metaphor is one of tightly encapsulated row/objects,
>>>so we have to produce objects, not arbitrary tuples/relations. I.e., we
>>>have to find instances of Employee, not just cities, or a combination of
>>>employees and their supervisors.
>>
>>Understood. For purposes of conceptual simplification, I'm considering
>>selecting an object to mean selecting the *primary key* of the object,
>>not any of its attributes. This allows me to translate to "standard"
>>relational theory.
>
>If you don't consider updates (which in the span of one query is not
>really necessary) you can think of the entire tuple as a unit with the
>same effect. But getting the whole row is basically an optimization, not
>a fundamentally different thing.
Right. In the current form of the theory, one could simply start with a
context RV that already had all its columns marked as output parameters,
and one could alwyays have a '*' traversal operator that labelled all of
the context RV's unlabelled columns as output parameters according to some
naming convention.
More information about the PEAK
mailing list