The Achilles Thread
Recovered from DotNetJunkies blog -- Originally Posted: Sunday, November 18, 2007
"[when debugging] If you see hoof prints think horses, not zebras".
Hunt and Thomas, in The Pragmatic Programmer
Developing in the brave new world of managed memory and garbage collection we, in most cases, don't have to worry about the details of memory allocation and clean-up. However, when our application starts having memory issues, we can easily have a tendency to fall into one of two camps:
- Assuming all memory issues are memory leaks (i.e. objects not being disposed of properly)
- Assuming that garbage collection is not working as stated.
Both of these assumptions are flawed in one common way: they narrow our focus to a limited set of possibilities and that affects the way we approach the problem. Sparing you the cutesy saying about what happens when we ass-u-me, it is fair to say that assumptions cripple any debugging effort. Once we assume we lose control over the "scientific" process of debugging a problem. In this case we had a utility application consuming egregious amounts of data, 1 to 1.5 gb of memory just to process a queue. I immediately assumed a traditional memory leak, as we were dealing with MSMQ, MessageEnumerators, Messages, SqlConnections, SqlCommands etc, I figured something was not getting disposed of or was still gcroot-ed somewhere in the code.
Problem scenario: In a console application we are iterating thru 85,000 messages in a "dead-letter" queue to see if we can salvage the data that never made it to its destination, the database. As we iterate thru the queue we pass off each message to a handler designed for that particular type of message using a factory. The handler de-serializes the message body back into the original object, generates a SQL insert based on the message contents and inserts a row in the database. As each handler returns a result, messages are removed from or left on the MessageQueue. Rinse, repeat.
If we remove the call to the datalayer component that is writing the new rows in the DB, the memory issue goes away. So clearly we have some SqlProvider objects not getting disposed of, right? Wrong. In surgically adding back in the datalayer functionality one method at a time it becomes clear that just opening a connection to the database once is all it takes to create the issue. Comment out the line that creates the connection, no memory issue, add it back in the death spiral returns.
To see if I could reproduce that scenario with fresh code, I reproduce the functionality from the ground up in a Windows Form application to make debugging more friendly. I set up a few different scenarios and, just to make the experience more lively, I add some counters to the forms so I can see how many messages have been processed etc. I take all the same steps as described above and..... no memory issue. I start reviewing the code method by method to see what I am doing differently. I find nothing, everything is exactly the same.....except.....in the windows form I am publishing messages to the GUI using a quick and dirty "DoEvents" call in the loop, who wants to stare at a frozen form? Then it hits me, I remove the DoEvents and the memory issue is back. Now that I have, not an assumption, but the real problem description it doesn't take long to find the answer as the links below will point you to.
The short answer proved to be this: In console apps the Main() form is launched by default on a Single Threaded Apartment Thread (STA). The same is true of a Windows form app and it's GUI thread. The STA model keeps that thread safe from multi-threading issues, especially when using COM objects that are acting like they are all alone in the world. So, when a COM component is created, used and released on an STA thread, it needs to be finalized. The finalizer thread has to wait until it is "invited to the party" in order to finalize that object. If the STA thread is in a long running loop, and is not publishing messages (ala DoEvents, Thread.Join() etc) then the finalizer is blocked until the loop is done and the STA thread takes a breath. This explains how one database connection could kill this app. The connection was ready to be finalized, the finalizer queues up to process that but is blocked waiting for the STA thread to take a breath. Meanwhile the STA thread is cranking thru a huge Queue and leaving an Exxon Valdez size slick of managed objects who are waiting on guess who, the finalizer thread.
The suggested solution: add an MTAThread() attribute to the entry point of your application (Sub Main() ) to prompt the runtime to load your main thread in Multi Thread Apartment (MTA) mode, now the finalizer can sneak in and do its clean up even though the main loop is not making any effort to share.
So, it wasn't quite horses, but not full-bred zebras either. Either way it was a fun ride.
© Copyright 2009 - Andreas Zenker