Developing Linq to LLBLGen Pro, Day 5
(This is part of an on-going series of articles, started here)
Consuming Expression trees, back to Special Case programming?
I'll show you 5 different queries and what their expression tree looks like in text (tested on Linq to Sql, so you'll see Table references)
Query A:
// C# var q = from c in nw.Customers select new { Name = c.ContactName, Orders = from o in nw.Orders where o.CustomerID == c.CustomerID select o };gives:
// Expression tree .ToString() output Table(Customer).Select(c => new <>f__AnonymousType0`2(Name = c.ContactName, Orders = value(WindowsFormsApplication1.Program+<>c__DisplayClass0).nw.Orders.Where( o => (o.CustomerID = c.CustomerID))))
In case you wonder... the WindowsFormsApplication1.Program namespace is the test app namespace I used to run the query with. As you can see, a bug pops up: instead of referring to Table(Order), it refers via the application to the context.Orders for the Where clause. This will be a pain to correct, I'm sure. Also there's just 1 Select call, while there are two in the query. This will be a pain too, but more on that below.
Query B:
// C# var q = from c in nw.Customers let x = c.Region ?? "Undef" where (c.Country != "Germany") && x != "Undef" select new { CompName = c.CompanyName };gives:
// Expression tree .ToString() output Table(Customer).Select(c => new <>f__AnonymousType0`2(c = c, x = (c.Region ?? "Undef"))).Where( <>h__TransparentIdentifier0 => ((<>h__TransparentIdentifier0.c.Country != "Germany") && (<>h__TransparentIdentifier0.x != "Undef"))).Select(<>h__TransparentIdentifier0 => new <>f__AnonymousType1`1(CompName = <>h__TransparentIdentifier0.c.CompanyName))Yeah, I also wondered about the transparent identifier, but it's the reference to a derived table-like sub result if you look closely. See also the C# 3.0 spec, where this is explained in detail.
Query C:
// C# var q = from c in nw.Customers where c.Country != "USA" select new { CompName = c.CompanyName };gives:
// Expression tree .ToString() output Table(Customer).Where(c => (c.Country != "USA")).Select(c => new <>f__AnonymousType0`1(CompName = c.CompanyName))
Query D:
// C# var q = from c in nw.Customers select c;gives:
// Expression tree .ToString() output Table(Customer).Select(c => c)
Query E:
// C# var q = from c in nw.Customers where c.Country=="USA" select c;gives:
// Expression tree .ToString() output Table(Customer).Where(c => (c.Country = "USA"))
Confused? So am I. Yesterday I spend some time on writing code for the execution of queries, setting up the proper LLBLGen Pro elements such as live transactions and the like so they're usable with the query objects so deferred execution still would take place inside a transaction if required (so SqlServer won't deadlock on you), and I also spend some time analyzing what to do with the Expression tree nodes. Yesterday, it occured to me that it was something like LR(1) parsing: you handle an element with the lookahead info and place the results on the stack for the caller to process further. If the caller sees it has enough info to reduce the elements on the stack, it reduces all those elements and handles it further, placing the result of the reduction onto the stack again.
However, to make that work, the input has to be deterministic, i.e.: you shouldn't wonder in state N in what scope you're in, started in state N-M. Now take a look at queries D and E. D has a Select call, E doesn't. Why on earth doesn't E have a select call? This is undeterministic. What it should have been is:
// What should have been E's correct structure Table(Customer).Where(c => (c.Country = "USA")).Select(c=>c)
The expression tree created by the compiler is wrong, because the expression tree parser in the Linq provider has to decide what to leave out, however the compiler decides that for me now. The sad effect of this is that I now have to add code which has to guess what the intention was: when no Select call is present, apparently one still wants to do a select, with the native projection.
This is doable, although it's a form of 'special case programming': as a provider developer you now have to anticipate on a special case: namely the one where there's no Select call in the expression tree despite the fact that there is a select statement in the query (there always is). With the occurance of a special case, one has to wonder: what kind of other special cases are there, which are unknown to me now? And also: if the compiler tries to be clever, why isn't it more clever so parsing the nodes is easier and straight forward?
The main pain with writing the Linq provider isn't the code, it's the unknown voodoo which takes place inside the compiler which makes it emit calls to Queryable(Of T) extension methods which build apparently odd expression trees at runtime. To be able to write a proper Linq provider, you have to understand the intention of the query. There's no other thing to look for. You're handed an expression tree. So by parsing that tree, you try to re-build the query the developer wrote in C# (of VB.NET) which lead to this expression tree. If you solely look at elements inside the tree, you'll run into problems with things which have effect on various elements of the tree, like a join into with DefaultIfEmpty() calls to signal a left join (yes, Linq's join syntax is that cumbersome). If you know the intention, you can produce the proper query code on all databases you support via your provider and also you can optimize the query when necessary.
It's now unclear what lead to which expression tree subtree, so in other words: it's impossible to deterministically determine what the meaning is of a subtree in the expression tree, at least at this point where documentation about why the expression tree looks like the way it does is non-existent.
It's not hard to write code which can handle trees like the ones resulting from queries C, D and E. The main issues arise when someone introduces a line with 'let' like in query B or a second select (projection) in the main projection as in query A.
Take a closer look at the differences between query A and B. You'll notice that although query A has two select statements in the query, it has just one Select method call in the expression tree, while in query B, it has two Select method calls while there's just 1 select statement in the query.
Query B, with its two Select method calls gives another problem: what to do with the results? Does it have to be projected onto new types, or onto entities? Projecting onto entities is different, as shorter paths exist inside the O/R mapper core, simply because one doesn't set entity values through properties when you're fetching (to avoid triggering validation logic which shouldn't trigger at fetch time for example). However, with two (or potentially more) Select method calls exist, which one decides what is done with the end-result? Again, special case programming and not deterministic code: to handle a Select method call, one has to know the scope it is in: it can't decide based on its own parameters and the lookahead what to do: which one is the end-projection? The last one? But do you know that when you run into the first one? No, so you have to seek through the whole tree for another select. If that's not found, you know you're the only one. So doing things based on what you see in node X isn't going to work.
In LR(n) parsers, the stack top is important too. So this could help in the Select method handler to determine if it should decide the end projection or if it should reduce to a derived table. But this could lead to a shift/reduce conflict , which one normally solves by either extending the lookahead or changing the grammar (no can do) or by adding a special case (not recommended).
I'm sorry if I sound a litle bitter, but this whole Linq expression tree parsing system is overly complicated while this isn't necessary. I wrote my own LR(n) parser engine and action/goto generators, I know what parsing trees means, but with Linq expression trees it's like with a box of chocolates, isn't it, Forest...
As I said earlier, what you want with code like this is to be able to proof that the code is correct. As there are 'special cases' popping up left and right, being able to proof something is correct is impossible, unless you have deep knowledge of the inner workings and know why the tree looks like the way it does. Without that knowledge, I can't optimize query A. Query A can be done with 2 queries in generic code (or one, if you cheat, but that leads to special code for single path graphs). However to determine this situation is seen inside the query is hard. Sure, you can look at the expression tree of query A and take a guess that if you see something like that it will be optimizable.
However, Linq to Sql (and also the entity framework) will execute a myriad of queries with query A (number of customers + 1). I'll try to find a way to optimize it, but that depends on whether I can recognize the subquery/subqueries in a normal fashion. This is harder than you might think, as the subquery might or might not have a Select method call (and Microsoft, with knowledge of the inner workings of the compiler didn't know how to optimize it)
Yes, I do wonder if it wouldn't be simpler if I just got the linq query as text and did the parsing myself. Anyway, I'll start simple, with queries D, E and then C. I've the whole mechanism in place now so I can now add dispatch routines to my Expression node dispatcher for the various expression node types within the range of BinaryExpression for example, so I'll move forward with my LR(1) (so 1 lookahead) lookalike engine, which is hopefully enough to handle the cases it has to fight with. It's likely I need tree traversal code for cheating in situations where the 1-lookahead isn't enough. This is cumbersome though and makes things fragile.