[04:03:36] [connected at Tue Nov 11 04:03:36 2003] [04:03:36] <> *** Looking up your hostname... [04:03:36] <> *** Checking ident [04:03:37] <> *** Found your hostname [04:04:09] <> *** No identd (auth) response [04:04:09] <> *** Your host is zelazny.freenode.net[irc.oregonstate.edu/6667], running version dancer-ircd-1.0.32 [04:04:09] [I have joined #peak] [04:04:09] ** zelazny.freenode.net set the topic to PEAK http://peak.telecommunity.com || WikiPages at http://peak.telecommunity.com/DevCenter || IRC-Logs: http://peak.telecommunity.com/irc-logs/ [04:11:03] jack-e|away is now known as jack-e [04:11:05] * jack-e is back [13:01:49] ** Maniac has joined us [13:01:49] ** SteveA|home has joined us [13:25:45] ** lex has joined us [14:32:09] ** pje has joined us [14:32:20] hi phillip [14:32:24] Hey. [14:38:56] * pje is still playing with metaphors for conceptual queries [14:39:27] I'm trying to find a unifying interface for what expression nodes need to do. [14:39:50] * jack-e has not yet played with peak.query at all [14:40:47] Well, there's not much you can *do* with it yet, except make bad SQL. :) [14:40:54] hehe [14:41:01] bad = non-portable, as the per-backend machinery isn't there yet. [14:41:17] i.e., the SQL generation is just proof-of-concept so far. [14:41:43] ok [14:43:45] i've read your mail about o-r mapping .. do you plan the storage-refactoring to happen before the next alpha ?? [14:43:53] * pje shrugs [14:44:05] * jack-e does not hope so :) [14:44:06] My original schedule for the alphas is in a shambles, at this point. [14:44:26] ok [14:44:33] "Real" work often forces priority changes. [14:44:45] yes [14:45:03] I'm hoping soon to be able to finish doc for a PyProtocols 0.9.1 release, and then maybe start on wrapping up PEAK alpha 3. [14:45:31] But I'd also really like to start a PEAK developer's guide, similar to the PyProtocols manual. [14:45:45] i.e., something that contains theory, examples, and reference. [14:46:01] that would be really helpful for all the interested people [14:47:00] Yep. [14:47:28] But it's a daunting task. The source of PyProtocols' manual is more than three times the number of lines as PyProtocols itself! [14:47:40] wow [14:47:44] If that ratio holds true for PEAK, my hands are going to be very very sore. :) [14:47:51] :) [14:48:20] Oh, and I don't put near as much paging whitespace in my manuals source lines... [14:48:31] And documentation text lines are longer than code lines. [14:48:51] Obviously, I write code with extremely high semantic density. :) [14:49:03] ;-) [14:49:16] And, I sometimes feel that I didn't write as much about PyProtocols as I could/should have. :) [15:11:43] * pje thinks that concept query expressions can be described in terms of type and labelled variables. [15:12:06] "Type" meaning the type of the values yielded by that expression. [15:12:37] And variables being like columns of the result of that query. [15:13:10] An expression that yields no variables is either a single-column query, or a subquery that can be used to restrict a containing query. [15:13:42] An AND node yields the most-specific subtype of its subexpressions, along with all their variables [15:14:07] (but gets an error if any of the expressions' types is not a supertype of the most-specific type) [15:14:49] An OR node yields the most-specific common *ancestor* of the types of its subexpressions [15:15:01] (and gets an error if the expressions have no common supertype) [15:15:20] NOT yields the same type as its subexpression. [15:15:57] FEATURE nodes yield the type of their containing type. [15:16:09] That is, the feature 'Employee.name' yields type Employee. [15:16:39] Comparison nodes yield the type of their nearest containing FEATURE node. [15:17:21] So in 'FEATURE(Employee.name,EQ("Bob"))', the EQ node yields the type of Employee.name (presumably 'str') [15:17:53] Or more likely, model.String, I suppose. [15:18:48] Thus, 'FEATURE(Employee.name,EQ("Bob"))' is an expression yielding the set of employees whose name is Bob. [15:19:05] (Or technically, the set of employee #'s) [15:20:12] And the expression 'AND(FEATURE(Employee.name,EQ("Bob")), FEATURE(Employee.drives, FEATURE(Car.color, EQ("red"))))' [15:20:28] Is the set of employees named Bob who drive red cars. [15:22:05] Several trivial conversions from this to SQL are possible, but they're not very useful. [15:22:40] A literal translation would use UNION and INTERSECT a lot. [15:23:48] Basically, conceptual queries are equivalent to the "domain relational calculus", which doesn't map well to SQL. [15:24:25] Hm. I think I've missed a few points about how the other expression nodes should behave, though. [15:24:49] An AND should yield any named variable exposed by any of its subexprs. [15:25:37] An OR node is similar, but only named variables exposed by *all* of its subexprs are guaranteed to be non-null. [15:26:19] And exposing variables that occur on fewer than all its subexprs requires an outer join in the SQL implementation. [15:26:59] NOT cannot expose any variables, since matching rows don't appear in the output. [15:27:29] FEATURE exposes the variables exposed by its subexpression, plus itself if it's tagged with an output variable name. [15:28:12] so FEATURE(Foo.bar,someExpr,name='fooVar') exposes the variable 'fooVar', of type Foo, in addition to any variables exposed by 'someExpr'. [15:29:47] And comparison nodes don't expose anything. [15:31:06] Actually, I suppose they should also be considered not to yield anything, either. [15:31:40] Otherwise, 'GT(42)' is implicitly a query representing all integers greater than 42. :) [15:32:28] Hm. I think that's now a very nicely consistent set of semantics for conceptual query expression nodes. [15:33:58] That is, as long as we ignore for now the MAYBE() and EXISTS() operators. [15:36:00] It also occurs to me that we need to distinguish between final output variables, and intermediate output variables. [15:36:15] The latter might be used for correlation, but not output as a result of the whole query. [15:37:01] So, if variables are also tagged for whether they are "shown" in the query output, then a NOT node should raise an error if its subexpression offers any such variables. [15:38:16] And an OR node need not establish outer joins for unshown variables; indeed an unshown variable should not be exposed outside an OR node where it's defined. [15:38:59] So far, all of these characteristics are "upward" in nature: a node can discover them from its immediate subnodes. [15:39:52] Which is very clean from a functional-programming POV, and makes query fragments sharable/reusable between queries. [15:44:58] Any node without "shown" variables may be represented as an IN or EXISTS subquery in SQL. [15:46:35] Or as a standalone query using joins, as long as you don't care about repeated values in the output. [15:47:28] So this is probably the best kind of expression to consider for how the bare-bones conceptual->relational translation should take place. [15:47:54] In its simplest form, we consider every FEATURE node to represent a table to be joined. [15:48:40] Walking down the tree, we find each FEATURE node, add it as a table, and keep track of the columns in it that will be used to join to its children. [15:49:03] (Or, if we encounter a comparison, to apply the comparison to.) [15:50:24] As we come back up, we AND, OR, and NOT the comparisons together, to create a 'where' restriction for the resulting join structure. [15:50:55] (Thus, we ignore boolean operators as we go down, and ignore features on the way back up.) [15:53:11] At the topmost level of the query, if there are multiple root FEATURE nodes, we must use UNION, INTERSECT, and EXCEPT to merge them. [15:53:34] Now there are several problems to this basic approach: [15:54:02] * Many FEATUREs may use values from the same table, but they'll be joined in more than once. [15:54:29] * We have to use "set" operators at the top level [15:55:09] * Join structures under OR and NOT may not be properly constructed [15:55:35] I think that maybe all of these problems are actually the same problem, in different guises. [15:56:09] Specifically, that a FEATURE node sometimes means the need for a new table, and sometimes it does not. [15:56:51] If a FEATURE calls for a new table, then an OR or NOT containing that feature can't be joined to the parent table in the naive way. [15:57:08] If a FEATURE doesn't call for a new table, there's no need for repeated joins. [15:58:25] And, if the top level features all deal with the same table, then set operators are not needed. [15:58:53] What's not clear to me yet is, what does "call for a new table" really mean? [15:59:28] As a human with a brain (mostly), I can look at a query and *know* whether the table is needed. [15:59:42] But, there is nothing in a feature (by itself) that indicates this. It's contextual. [16:00:33] So, we need to look at a much simpler query structure, like the old FEATURE(Employee.name,EQ("Bob")) query. [16:00:55] Here, it's trivially convertible to SELECT empnr FROM Employee WHERE empname="Bob" [16:01:26] Or more precisely, Employee(where=Employee.empname.eq("Bob")), in today's peak.query lingo. [16:02:40] If we want to AND this with FEATURE(Employee.Branch,EQ(42)) (i.e., employees whose branch is #42), what do we do? [16:02:43] jack-e is now known as jack-e|away [16:02:47] nite [16:02:53] Adios! [16:02:57] have fun :) [16:04:32] Clearly, we want to end up with Employee(where=Employee.empname.eq("Bob") & Employee.branchnr.eq(42)) [16:05:48] In the case where each of the constituent queries is against a single, identical relvar, we simply perform the boolean AND or OR of the conditions. [16:08:03] (And keep the relvar, of course.) [16:08:51] Now let's try it with similar, but unequal, relvars... [16:09:14] FEATURE(Employee.Branch,FEATURE(Branch.city,EQ("Paris"))) [16:09:26] AND-ed with our Bob query. [16:10:21] Employee.branch and Employee.name actually come from the same table, so they can be combined, but the FEATURE(Branch.city) subtree cannot. [16:11:41] However, if we AND in another condition about the branch, found in the branch table, then we could combine that with the other branch condition. [16:14:59] Hmmm. I had thought that by keeping the cases very simple here, that I could avoid the issue of one-to-one versus one-to-many. [16:15:20] However, that's not so, even for something as simple as this. [16:15:59] Consider an expression such as 'AND( F(x,...), F(x,...) )' [16:16:47] We cannot combine those FEATURE subexpressions in any way, unless we know that feature 'x' is one-to-one. [16:17:22] Otherwise, this might be a query that requires there to exist two different 'x', each meeting a different criterion. [16:18:43] Really, we also need to know whether a feature might be one-to-zero. [16:19:18] (i.e., if it's optional) [16:19:46] Hm. [16:22:08] Okay, let's abandon trying to map directly to a join structure. [16:23:19] Instead, let's consider each node as a specialized sort of relvar, that knows its "key columns", as well as its output columns. [16:23:36] Thus, a FEATURE node containing only conditions maps trivially to such an extended relvar. [16:24:29] Now, looking at it from a relational algebra point of view, suppose we implement UNION, INTERSECT, and EXCEPT operators. [16:25:17] But we allow them to produce a result that combines tables where appropriate. [16:26:39] To do this, we'd have to treat all lower level FEATURE nodes as 'IN (SELECT ...)' subqueries, and there would be no joins involved. [16:27:21] There seem to be two optimizations, then, that we want to be able to do: [16:27:48] 1) convert a UNION or INTERSECT of two identical relvars, to conditions expressed against a single relvar [16:28:26] 2) "lift" subqueries into joins, if it doesn't affect the semantics. [16:30:19] We can perform optimization 1 whenever the "key" of the combined relvars is unique, since the conditions must then always apply to the same row. [16:31:50] We can perform optimization 2 if the child query's "key" is unique, *and* the subquery is one of the top-level conjuncts of the current query. [16:33:18] If there are multiple IN subqueries that apply to the same column(s) in the top-level query, we can attempt to union or intersect *those*. [16:37:59] Intersection would be straightforward - scan the conjuncts, find subqueries on common items, attempt to intersect. [16:38:33] Then scan the disjuncts, and union them, replacing the disjunct. And so on. [16:38:51] Hm. Actually, it's simple in principle, but complex to do. [16:41:17] Okay, forget using augmented relvars. Instead, we'll use "feature subtrees" paired with conditions. [16:41:52] A "feature subtree" is just a tree of feature nodes, by themselves, with no booleans involved. [16:42:43] Then, we define unions and intersects on these, remapping the conditions to point to different features as appropriate. [16:43:18] Actually, I guess they're really relvar trees, sort of. But they're not joins, and we don't convert them to joins until we get to the top of the query. [16:45:01] These relvar trees consist of either: [16:45:07] 1. a relvar node with children hanging off its output columns, or [16:45:34] 2. A UNION, INTERSECT, or INVERSE node. [16:45:54] UNION and INTERSECT have N subtrees, and INVERSE has one subtree. [16:46:20] Now we can merge two relvar trees... [16:47:56] Ignoring INVERSE and not simplifying for symmetry, there are eighteen possible combinations. [16:48:37] Simplifying for symmetry, there's twelve. [16:49:23] Starting with the simplest case, of two trees that root in a relvar, either they are the same relvar+key, or not. [16:50:28] If they are, and the key is unique, we can combine the nodes, whether we are performing a union or intersect. [16:50:57] We then recurse to perform the same operation between the named children of each original relvar. [16:51:58] If the two relvars are not combinable, then we place them under an INTERSECT or UNION node, as appropriate. [16:53:02] ** Maniac has joined us [16:53:18] Heh, for a second there I thought I'd finally turned you into your namesake. ;) [16:54:06] If we want to intersect a relvar with an INTERSECT, or union it with a UNION, we loop over all the multi-item node's relvar children, seeking a suitable pairing. [16:54:54] If one is found, then we proceed as before, otherwise we hang the new relvar off the same combining node. [16:55:33] If we're unioning an INTERSECT or intersecting a UNION, it's trickier. [16:55:54] intersecting a UNION means we are intersecting each element of it. [16:56:49] So, we could loop through them and intersect them all. But I don't think that does anything positive. [16:57:17] After all, we are guaranteed not to combine with more than one of the child nodes. [16:58:24] Likewise unioning an INTERSECT doesn't seem to do anything helpful, either. Both just seem an invitation to add more complexity to the tree. [16:59:24] I think that covers all the cases that involve a single relvar on one or both sides. [16:59:57] Leaving only six other possibilities. :) [17:00:27] We'll quickly dispose of INTERSECT intersecting INTERSECT and UNION unioning UNION... [17:01:26] In each case, we loop over one side's children and try to combine them with the other side's, then merge the remainders. [17:03:02] Now what about I-u-I and U-i-U? Again, messing with these seems like an invitation to complexity blowup.. [17:03:59] So we'll leave those alone. [17:04:27] Okay, only two combinations left: I-i-U and U-u-I [17:05:34] In each case we should probably just add the interloper under the appropriate node type. [17:06:10] Now let's see if we can turn these operations into a simple polymorphic algorithm... [17:06:35] We'll say a node has a 'combinable(otherNode)' method that indicates whether it can be combined with another node. [17:07:10] I and U nodes are never combinable; their methods return False. [17:07:27] And rv's method returns false if passed an I or U node, or an incompatible rv. [17:08:00] Nodes also have disjuncts() and conjuncts() methods, similar to logical expressions in p.q.a. [17:09:05] So, an I node returns its contents for conjuncts and itself for disjuncts, and vice versa for U nodes. [17:09:14] RV's return themselves for either method. [17:10:10] Now, to perform an intersect between two items, we loop over the each item's conjuncts, matching up combinables. [17:10:33] If we end up with only one node, we output it, otherwise we return an I node over the nodes we ended up with. [17:10:53] And for a union, we loop over both items' disjuncts instead of conjuncts. [17:11:55] And the I/U node constructors will "flatten" any nested I's or U's the way p.q.a AND and OR nodes do. [17:14:05] As we combine nodes, we build/update a translation table mapping the original uncombined RV's to the new ones. [17:15:20] Thus, we know what the end conditions apply to. [17:17:36] Hm, rounding out our previous node type sets, I guess I should've said that an INVERSE node may combine with another node if its child is combinable with the other node. [17:18:53] Or is that really true? [17:19:07] For a single subnode, it is. [17:19:38] But what about nested subnodes? [17:19:58] This needs some more thought, especially in how conditions would be applied. [17:21:11] If you do a NOT over a set of conditions that apply to a single relvar, then you can simply NOT the conditions, without putting an INVERSE node over the relvar. [17:25:01] * pje wonders idly if he should create a peak-design channel for these ramblings. :) [17:26:24] But conditions that apply to multiple relvars can't simply be NOT-ed. [17:29:55] So, when you 'invert' a relvar tree, the relvar either inverts its conditions if it has no children, or returns an INVERSE node encompassing itself. [17:30:01] So INVERSE nodes are not combinable. [17:30:24] Except with another INVERSE node, that is. [17:31:29] INV(a)-u-INV(b) or INV(a)-i-INV(b) should result in INV(a-i-b) or INV(a-u-b). [17:32:12] Alrighty, now we're cooking...! [17:34:16] Now all we have to do is describe how to go from a relvar tree to a p.q.a query. [17:36:20] It's interesting to note, by the way, that this join-combination algorithm is entirely bottom-up. [17:39:12] I think I need to try it on a couple of test cases to see what kind of tree we end up with... [17:39:59] Let's look at "employees who drive a red car or speak french"... [17:41:49] OR(FEATURE(Employee.drives,FEATURE(Car.color,EQ("red"))), FEATURE(Employee.speaks,EQ("French"))) [17:42:07] * pje begins scribbling a node diagram for the relvar tree [17:44:00] Hm, not interesting enough. Let's try a couple different ways of specifying "drives a red car or drives a fiat"... [17:44:43] FEATURE(Employee.drives,OR(FEATURE(Car.model,EQ("Fiat")),FEATURE(Car.color,EQ("red")))) [17:44:45] versus [17:45:47] OR(FEATURE(Employee.drives,FEATURE(Car.color,EQ("red"))), FEATURE(Employee.drives,FEATURE(Car.model,EQ("Fiat")))) [17:48:09] Hm. This shows that combinability of relvars is a bit more complex than I thought. [17:49:53] ** pje has joined us [17:57:11] And I haven't even looked at the "optimization 2" issues yet. :( [17:58:58] (i.e. promoting 'IN (SELECT ...)' to a join if the nested query has shown variables, or is a top-level conjunct and is not one-to-many. [18:00:11] ** lex_ has joined us [18:00:40] This is effectively the job of a FEATURE node on the upward route; it either applies its subquery as a condition on the relvar it represents, or else joins the subquery to the relvar. [18:05:47] On the downward route, the FEATURE node passes down a CV to be used by comparison operators. [18:06:06] On the upward route, AND and OR operations perform unions and intersects. [18:06:13] And NOT performs inversion. [18:07:26] When a feature receives an inverted subquery on the upward path, it converts it to a NOT IN condition. [18:08:02] If it receives a union or intersect node, it converts to AND/OR x IN conditions. [18:09:11] * pje sighs, nearing exhaustion for this session. [18:10:06] At least, this is strong forward progress. [18:13:23] One interesting thing is the dual nature of subqueries and conditions. [18:13:49] In effect, the child of every feature can be viewed either as a condition or a subquery. [18:14:13] Simultaneously! [18:14:55] i.e. it's both a query on its own and "parentFeature_outColumn IN (subquery)" [18:18:22] More precisely... a FEATURE node is both. A comparison is only a condition. A boolean construct is a condition. [18:19:00] Unless the boolean combines a pair of combinable queries. [18:19:12] In which case it becomes a query. [18:19:41] (Sounds like possibly a job for adaptation...) [18:20:27] So, a FEATURE node adapts its child to a condition, and returns a query (that can be adapted to a condition). [18:21:34] A boolean tries to combine any of its children that are queries, but has to leave conditions alone. [18:22:12] A boolean can be treated as a query only if it has queries as children. [18:23:19] (If it has any condition-only children, then the query structure is missing a parent FEATURE that the condition(s) apply to. [18:38:47] Hm. Actually, a FEATURE node probably wants to view its child as a query first, in order to attempt to join it, if possible. [18:41:09] Hmm... no, that's still not right. [18:41:22] It wants a condition, period. [18:41:46] But, if any conjuncts of that condition express joinable queries, it can pull them out and join them. [18:43:22] So... [18:43:23] FEATURE(x,condition) -> query [18:43:39] AND(q/c,...) -> condition [18:43:48] OR(q/c, ...) -> condition [18:43:56] NOT(q/c, ...) -> condition [18:44:03] Oops... [18:44:08] NOT(q/c) -> condition [18:44:53] adapt(query,Condition) -> cv.isIn(query) (SQL: "cv IN (SELECT ...)") [18:45:21] Cond(op,valueOrParam) -> condition [18:46:20] I guess we could say all nodes have to have an 'asConditionOn(cv)' method, which is what FEATURE() nodes would call on their child. [18:46:53] This would allow the CV to be passed down to pure-condition nodes. [18:48:15] And booleans would pass it down to their nodes, so that if they're over a query, the query returns cv.isIn(query), and if they're over a condition, they get Cond(cv,op,valOrParam). [18:48:35] So, the boolean would then return a new boolean condition formed over its constituents. [18:49:16] Of course, the boolean's constructor would've already handled combining any combinable children. [18:50:07] And the feature's constructor would call child.asConditionOn(CV) for its newly created CV. [18:50:33] Hm. Actually, FEATURE could simply be a function, because there doesn't need to be a feature object at all. [18:51:10] FEATURE(x,cond) is simply a transformation from a condition to a query, using the schema mapping info. [18:53:07] It clones the schema-specified relvar, takes the appropriate columns and passes them into child.asConditionOn(), to get a where= condition to apply to the relvar. [18:54:26] It then looks through the condition's conjuncts to find individual conditions that can be promoted to joins. [18:55:16] Excellent. I think I'm ready to wrap it up for tonight. [18:56:22] Open issues still exist for both kinds of optimization (subquery lifting and relvar combining), but I think these are now practical matters for coding, not items requiring new theory. [18:57:55] Most important, it's now largely clear which nodes have responsibility for what. [18:59:22] Hmmm... FEATURE has to be an object, after all, since it has to be able to get at a schema mapper; something to be supplied in a downward call. [19:00:20] So you'll have to call e.g. featureNode.asQuery(mapper). [19:00:59] .asConditionOn() doesn't need a mapper, since you only call it on queries, not features. [19:01:41] Whoops. That's wrong. They both need the parameter. [19:01:49] Ah well. [19:02:17] * pje yawns [19:05:28] Interesting; there's no need for UNION/INTERSECT nodes any more. AND/OR do the same thing now that we separate conditions and queries. [19:06:18] But, they might pop back up once we consider output variables, outer joins, and suchlike. [19:09:20] Or maybe features will handle that when they elevate subqueries to joins. [19:09:27] "Tune in tomorrow, to find out!" :) [19:10:46] * pje waves [19:10:55] ** pje has left us [20:57:02] ** _Maniac has joined us [23:58:56] ** _Maniac has joined us [23:58:56] ** SteveA|home has joined us [23:58:56] ** jack-e|away has joined us