Developing Linq to LLBLGen Pro, part 7
(This is part of an on-going series of articles, started here)
Last time I talked about the switch to the approach where most Queryable extension methods should be seen as sets on their own. What happened after that? Well, initially, I continued on the path I had taken a few weeks ago: a stack based term evaluator which was used by routines which handled the bigger terms. I created a couple of classes which were usable as a set-tree, but a single-path tree: a Where would modify the top of the tree, a Select would wrap the top of the tree with a new Set, using the original top of the tree as its source.
This approach went fairly well, I could write several unit tests which could fetch entity data, with or without filters on fields in these entities and create single-field projections, using LLBLGen Pro's own projection framework, though it all stayed a little simple. Doing the simple things first isn't a bad approach per se, after all, the complete scope of expression trees and what to convert into what is pretty complex and to grasp it in full takes a lot of time. Software Engineering, as its name implies, is a form of engineering so not all variables are known: you have to do experiments, work with the elements you do know and work from there towards a goal, solving the problems along the way. Have to create a bridge across a big stream but you don't know the exact strength of iron? Do experiments to find out. In Software Engineering things might look totally different but actually they're not.
With implementing a full Linq provider it's the same thing: there are many, many fine details inside the tree which can either be ignored or be very important to you, depending on the goal you want to reach, thus which provider you want to create. This means that if you write a provider which is actually a translator between two query languages, like I'm doing, you have a different set of problems than when you write a provider which has to generate SQL.
I hit a wall, when I wanted to add Joins to the code I had. The problem was that I tried to convert way too many elements at once into new objects. I also made the mistake that the single path tree wasn't really helpful because the Join expressions could contain select, where etc. so in a way it was a multi-branched tree. However, if you try to combine a lot of elements at once, this problem doesn't exist, after all handling the Join would mean handling the inner Select and Where, if they were present. However, to do that in practise, you have to do so many things at once, it's too complex.
During my engineering efforts on this code, I learned a lot about the Expression trees and how not to approach them, as you can read in these articles. However, I think these mistakes are unavoidable simply because it's unknown territory: what do handle first? How many elements in the Expression tree should you convert into new elements in one pass? Questions which are only answerable if you have seen the different approaches fail and succeed. This is in general true for many software engineering tasks. Often we're faced with a problem which is far too complex to understand in full from the beginning, we have to take smaller steps and solve the puzzle by solving smaller puzzles instead. However, the big question: which smaller steps to take and what's the best size per step is often a question which looks easy but is actually the hardest and also the most important part of the whole project.
To be able to handle the multi-branched tree and also to avoid converting way too much elements at once, I decided to modify the Expression tree in-place, exactly how Matt Warren explained in his series about writing a Linq provider. Matt uses a single pass routine (2 passes actually), and in the end it might be that a single pass or 2 passes is all that it takes, but when you're writing the thing and taking one step at a time, it's best to create a lot more passes, so a lot more handlers, which all change one thing in the tree, so instead of 1 big step in 1 pass, do a lot of smaller steps in a lot more passes.
Finding the right order
The key point is to find the right order. In general this isn't that hard, but if you have the wrong order, or if you're trying to handle too many nodes at once, you probably will run into problems later on, for example if you have already recognized entity field member access points, and converted them into field expressions, but you didn't assign any aliases yet to the tree sources. It's then hard(er) to find back to which alias the containing entity of the field accessed belongs to.
Rule of thumb is that in pass n+1, all nodes required for handling the nodes in that pass have to be there. For example, if you are going to replace BinaryExpression nodes containing a boolean operator (equal, lesser than etc.) with predicate expressions, it's best if you have already found entity field access nodes, recognized them and converted them into new expression nodes. This way, you can easier form predicates when handling BinaryExpression nodes because you can handle the left and right side and check what the handle action of these two sides resulted in, e.g. a predicate expression, a field expression etc. Though it's not tic-tac-toe: an expression like 'field < 10' can also be inside a projection or field expression inside a predicate. So it's best to do this in 2 passes: first the field expressions, then the predicates. This is an example how taking two passes (which are combinable perhaps later on) to solve the problem is not a hard puzzle: however to solve it directly in 1 pass might be a big challenge.
At the moment, my general order is to first traverse the tree and convert any Join method call expressions into JoinExpression objects, and assign aliases to the left and right side, plus to build some lookup tables in an Alias storage with all kinds of key entry points, so the rest of the tree handling can always find back the alias source to refer to based on the info known at that point. LLBLGen Pro only needs aliases if there are multiple times the same entity joined into a join set, so only when a join is present (or access to members which represent related sets, but more on that in a later article), aliases are necessary. Ironically, the last pass to process is the pass which actually handles JoinExpressions and converts them into relation objects for our API (together with the flattening of SelectExpression nodes, which are nodes which result in a set, with or without a projection, e.g. Where method calls, Select method calls etc.)
The reason why JoinExpressions are handled last is that by handling them the whole query is actually handled but doing it earlier, e.g. placing the handling of JoinExpressions earlier in the pass-stack, will run into a problem where the join is between an entity set and for example the result of a full other query, with a projection, where and for example also a join. It can be the JoinExpression handler isn't the last pass in the final result, however for the queries currently understood by the handler system, it is. Because everything is chopped up into smaller handlers, it's not a big deal to place a handler somewhere else in the stack, as long as the dependencies allow it.
To cut the process up into tiny fragments, you at first do a bit more work: you'll write a lot of handler classes which sometimes don't process much data. However, that's ok, you can always combine passes later on. To have the smaller, single item focussed passes, you can schedule them easily and if you start handling more complex queries, you can add more passes between other passes without having to worry you have to refactor a lot.
The last good point is also the best: it's provable code. If you have ran pass n, you know for sure the tree consumed in pass n+1 won't run into nodes handled in pass n and will see 0 or more nodes produced in pass n. This does include dependencies, but I think that's unavoidable. It's like a state machine: at state S, you run pass S, and that results in state S+1, where you run pass S+1. The state machine is pretty linear, as there aren't any events occuring inside the pass routines which could make the state machine go into another branch, but theoretically it could. For example, you could have a handler which simply checks the tree and if it finds a given situation, it might decide it's OK to run pass P first, and then pass S+1. Due to the nature of the small steps taken, it's flexible in doing that.
There are a couple of things I don't know how to handle them yet (I haven't looked at all extension methods yet like GroupJoin etc.), which are the Convert and the Quote expressions. For now I strip them out when necessary (and if it's even possible, sometimes you need to leave them in the tree to keep type equality), though I couldn't find a lot of information what these fellows do (Convert seem to be there to overcome the type problems in a BinaryExpression between a nullable type and a value typed constant) or when they've to stay. But time will tell, I guess.