Why change-tracking has to be part of an entity object
Recently, Andres Aguiar started a discussion between him and Udi Dahan about change tracking in the upcoming Entity Framework (EDM) from Microsoft. Basicly, Andres described why it was unfortunate that the EDM doesn't have change tracking inside the entity objects itself and gave some examples why that's unfortunate. Udi pulled the discussion into SOA land, and I think that was unfortunate as well, because there's a much wider used example which will illustrate why Andres is right and the rest is wrong: ASP.NET 2-way databinding. This article will be very technical, and it refers to stuff you won't run into most of the time, as it's functionality which is deep inside controls you just use. However to be able to use these controls and these features available to you in ASP.NET 2.0, they shouldn't force you to implement a lot of plumbing code yourself, as the sole reason you're using these controls is because they are the plumbing, they should take care of all that, why else bother using them?
First, let me link to the explanation about how EDM does change tracking, and this post already explains a bit about the true pain one has to go through with ASP.NET (there are other examples, but ASP.NET is an easy example and a lot of you can relate to it, so I'll use that) and EDM: David Simmons about change tracking in the EDM. Now, what's the whole deal here? Why is this 'change tracking inside the entity' so important?
Imagine a simple webform: it has a gridview on it, and it's setup to edit Employee rows from Northwind, nothing fancy. As you're a good developer, you created a 'logical middle tier' which will provide the data for the UI and which will also handle modifications or other work the UI wants to get done. How this is done, via repositories in DDD style or via stateless manager classes, is not important. With 'logical middle tier', I mean: it's not necessarily a physical layer, it can be a part of another 'layer', and act as a middle-man between UI and DB infrastructure code. For simplicity, I'll refer to this layer as the Business Logic, or BL. One reason you've created the BL for is because it separates how data is obtained from whatever persistent storage you're using from the place where it's actually used: the webform. For the Data Access Layer (DAL) / DB infastructure code you'll take this fancy new EDM stuff Microsoft told you about.
Now, because you have better things to do with your time than writing a lot of plumbing code for ASP.NET, you want to use the ASP.NET 2.0 databinding feature: by placing a datasource control onto the page, you can connect your gridview control and your BL methods together. This means that when the page is rendered and the gridview asks its bound datasource control (the datasource controls all act as a slave to the control(s) bound to them) for data, the datasource control will call into the BL methods to obtain the data. For updates of data this works the same: the datasource control gets a call from the grid that some data has been changed and then sends the data to the method(s) defined to call for insert or update, which in this case are methods in the BL.
What happens inside these datasource controls? Well, it's pretty straight forward (writing a datasource control isn't straightforward however, I can assure you ): when the gridview asks for its data from the datasource control it's bound to, it calls the ExecuteSelect method on the datasource control. The datasource control will then try to obtain the data for its master, in this case the gridview control. If this call happens for the first time, no data has been fetched yet by the datasource control and it will then try to obtain the data from the BL method you've setup which will perform the fetching of the data.
The BL method you're using is located in the EmployeeRepository class and simply fetches all Employees from the persistent storage using EDM: it returns a List<Employee> with all Northwind Employee entity objects. When you first run the page, the gridview calls the ExecuteSelect method, which in turn will make that your BL method is called and the datasource control receives a nice collection with Employee objects, all filled with data. This list is returned as the return value of the ExecuteSelect method and the gridview control uses this list to display 9 rows (or more/less, depending on your Northwind version/state . Let's keep it at 9 for now) in the gridview with Employee data.
Under the hood, the datasource control also stored the fetched 9 objects somewhere, be it in the viewstate, session or ASP.NET cache. This is done to be able to realize two-way databinding. This storing of data is necessary because when a page is rendered, the object itself and its total state is gone. All that's left on the server is the session object (if any) of the user, the ASP.NET cache and the thread which handled the request which is likely assigned to a new request by IIS. So when the post-back comes when some data is altered in the grid, the right objects have to be altered. How does this work?
When the user alters some data in the gridview and clicks the Save button on the row, the page will postback to the server. At that point, the gridview will collect what has changed and will call for all updated rows together the ExecuteUpdate method on the datasource control only once. The ExecuteUpdate method gets little information what has changed: the PK field values and the values of the field(s) which have changed. The datasource control pulls the stored data from the viewstate, session or ASP.NET cache (depends on where you put it via config settings on the datasourcecontrol, doesn't really matter where it's stored) and updates the same objects which were received from your BL method which was executed when ExecuteSelect was called by the gridview.
The gridview control shipped with .NET 2.0 updates 1 row per post-back. But some grids don't, they'll update multiple rows, for example when they allow editing on the client and do one big postback, so the datasource control in theory can receive multiple updates in a single ExecuteUpdate call. Let's say the user updated 4 fields in a single Employee entity object and clicked Save. A single entity object in the set of 9 entity objects is now updated and has 4 fields being changed. It also means that 8 entities are left untouched.
Which object knows which entity objects are new, which entity objects are not new but not changed as well and which entity objects are changed? Only the datasource control, based on the values passed into the ExecuteUpdate method. Now, let's assume the datasource control developer is smart and thus wants to try to take advantage of this information, so the datasource control developer will only call the update method specified with the datasource control (and which is a method in your BL) for the entity objects it changed in the ExecuteUpdate method. So your BL method gets a single call and gets an entity object passed in, namely the entity the user apparently altered. But, would you write your BL method in such a way that it receives all the names of the changed fields and the original values as well? No, of course not, you would write the BL method in such a way that it receives an Employee entity and the infrastructure used inside that BL method has to figure it out, otherwise using the method from other code would be very cumbersome (as you always would have to specify the changed fields and original values, which is something you as an application developer shouldn't be worrying about, it's plumbing code after all) and your API would be very rigid: it's not flexible enough as one field change in the Employee entity and the method is useless.
Though, your BL method is now faced with a problem: it only has an entity object with the new values, but which fields did change and which didn't? Did the photo field change, so a large blob has to be send to the persistent storage? Or just the phone number? You don't know, as you don't have the original values. Is that int-typed field with value 0 really NULL or just 0? You don't know. So all that the DAL / DB Infrastructure code can do is simply generate a SQL query which updates all fields in the table(s) the Employee entity is mapped on. This includes updating the photo field, which could be large and thus could mean a performance bottleneck. But what's worse: no concurrency checks can be made, as you don't know the original values, nor do you know which fields were changed. Also, if a trigger was defined on the table(s) which is doing something if a given field is updated, it could be ran without the field really being changed (as it's updated with the same value). This could lead to big problems.
The 'solution' Microsoft provides for this, which is also a problem of Linq to Sql btw, is the following: you first have to attach to the context/session object an entity object with the original values, then pass the entity object with the new values and then it can perform change tracking. Yes, of course, but where do you get that entity object with the original values from? From the DB? oh no sir, you can't do that, as it might be that the entity was changed in between, and if the user would have seen THOSE values, the user might not have altered the entity at all. You also don't get the entity from the datasource control, so you have to store it yourself, in an in-memory cache. Though what happens when you're on a webfarm with load-balancing?
This starts to get really complicated, and you would easily forget the most important thing: why is it YOUR problem?. Isn't this part of what we could call 'entity management' which should be taken care of by the framework you're using? Wouldn't this be solved properly if the entity object itself was smart enough to have change tracking on board? Yes it would. That's why Andres is so right with his remark and the rest isn't. Andres understands correctly that the end user of the framework doesn't care nor shouldn't care about change tracking and when to pull old values from some cache just to make something simple work.
That's also why LLBLGen Pro, the O/R mapper framework I'm the lead developer of, has change tracking build inside the entity objects since day 1. Our users don't have to worry about change tracking issues at all, they just have an entity object and what's changed is located inside it, including its original value (if a change was made), because it's the business of the framework to take care of that, so the user of the framework (the developer) can focus on important things, like writing that application the developer has to finish. Pass the entity object over the wire to another server? No problem. Collect work inside a Unit of Work object which is then passed over the wire to a service or server? No problem, as everything is contained where it should: inside the object which knows what's changed as that object and only that object owns the data that has been changed.
One could argue all day why with 'some extra work' the out-of-entity-object changetracking setup as seen in EDM and Linq to Sql is still 'usable', but that person then would overlook one basic core thing: the user of the framework shouldn't be the one who should solve plumbing problems of the framework: that's the job of the framework. Why else use that framework then, if it only gives you extra work, which costs time, can create bugs which always pop up on friday afternoon etc. etc.
Don't think that this is solely related to ASP.NET. I used ASP.NET as an example to show the stupidity of this EDM/Linq to Sql design flaw, but any setup where an entity object gets disconnected from the session/context which fetched it, will result in this 'DIY changetracking' code. And for what? Why isn't this solved properly? Beats me... you'd think that after all these years, Microsoft would come up with a framework for data access / entity management which would really help the developer with taking care of the plumbing crap and let the developer focus on what really matters: the application code. Oh well... maybe next time