Troubleshooting an Intermittent .NET High CPU problem

Thursday, April 9, 2009

We’d been getting sporadic reports of high CPU usage in Witty (a WPF Twitter client). I’d tried running the application in debug mode for a while and could never get it to occur, but finally I saw it happening while I was running a release build (keep in mind that 53% is indicating that one of my two CPU cores was saturated):

Great, let’s attach a debugger to see what’s going on:

Troubleshooting high CPU use often leads to chasing heisenbugs, since the CPU usage of course drops when you break into the code. But, I could at least step through the current code a bit to get an idea of the current state. Doing that showed a lot of time at System.Threading.WaitHandle.WaitOne, which made me suspect at threading issue.

On the other hand, breaking into a multithreaded app could have caused that, too. We need more information – let’s take a look at the current threads:

Note to self: should have named the threads so I’d have a better idea of what they’re all doing. I’ll talk more about that in a little bit.

Now, there are probably two ways to go from here:

Smart, systematic debugging
Brute force hacking

In this case case, since we don’t have that many threads to deal with, I think starting with some brute force interrogation is a good way to get started. If we don’t find anything, we’ll cool off and do the whole Sherlock Holmes contemplation thing, complete with pipe and violin.

First Guess: The Red Herring

This is not the real problem, it’s a wrong guess. I’m including this for two reasons:

It’s the truth. It’s what I really did. I’m not going to edit out my stupidity… this time.
It shows a little interesting information on loading symbols from the Microsoft symbol servers to step into and debug .NET framework code

That said, if you want to skip over this bit, I won’t be offended in the least.

So at this point, I started just looking through the thread list. When we activate a thread, even in disassembly, we can get a pretty good idea of what it’s doing. There’s a current call stack, and the disassembly view shows us the current executing method:

Well, our current hunch is that the System.Threading.WaitHandle.WaitOne call has to do with the problem, and it’s showing up as the current executing method, so it’s worth looking at for a bit. But that disassembly is over my head. We need to get the actual source… fortunately, that’s not tough. We can load the source from the Microsoft Symbol Servers:

And a few seconds later, we’re interactively debugging framework code. Paging Dr. Awesome…

Note: If you’re not able to see the framework code in the call stack, you’ll need to turn off the “Just My Code” debugger option:

I’ll spare you the embarrassment of watching me wander around the WPF code which pulls down databound images from internet addresses. It’s interesting code, but after poking around for a bit I convinced myself that this wasn’t the source of the high CPU issues.

Wising Up: Freezing Threads To Find The Culprit

At this point, I realized that I needed to get a little smarter. Instead of just browsing through all these threads just hoping to find the problem, I needed a smarter way to determine which one was actually the problem. Well, how about this?

Break into the debugger
Freeze one of the threads
Resume the app
See if the CPU usage has dropped – if so, we’ve found the bad thread, so we can debug it
If not, repeat

Simple enough. You can right click any of the threads in the list and freeze it:

Now, this is still kind of voodoo troubleshooting at this point, since there’s a possibility that the problem’s caused by the interaction of two different threads, or something else that’s going to be masked by a one-at-a-time approach. I’m playing another hunch here – I know enough about this application to assume that most of the threads are independent.

Intermission: Play Along At Home with this Simple Runaway Thread Demo

This may be a little tough to follow since there are a lot of moving parts. Let’s step back and look at a very simple application with one runaway thread. Create a new console application and add the following code:

using System.Threading;

namespace RunawayThreadSample
{
    class Program
    {
        static void Main(string[] args)
        {
            for (int niceThreads = 1; niceThreads <= 10; niceThreads++)
                new Thread(() =>
                   {
                       while (true)
                           Thread.Sleep(500);
                   }) { Name = "Thread " + niceThreads }.Start();

            new Thread(() =>
               {
                   while (true)
                   { }
               }) { Name = "Evil Thread" }.Start();
        }
    }
}

A few things to notice here:

We’re naming our threads this time. With .NET 3.5’s object initializer syntax, we can set the name when it’s created.
We’re using lambda syntax to define the anonymous thread delegates – that’s the “() => {…}” stuff. Karl Seguin covered that pretty well in his Back To Basics series.

Start up Task Manager (or run ProcMon), then run the application. You should see your CPU use pegged at around 100%, as the “Evil Thread” spins in a tight loop. Now break into the debugger – you should see your CPU usage drop as all the threads are suspended. Freeze the Evil Thread (right-click, freeze):

Then resume the application (by hitting F5) and you should see your CPU usage stays low.

Returning To Our Regularly Scheduled Program…

Sure enough, freezing and thawing threads in turn points me to a single thread which is causing the high CPU usage – one that’s spending a lot of time in Witty.SingleInstanceManager.WaitForSignal(). Interesting… a little searching lets me know that this class came from a CodeProject article about how to enforce a single instance for an application. So that explains why this bug was hard to reproduce – it only happened when the application was already running and you tried to start a second instance. That’s not something you’d normally do on purpose, so the bug just appeared to happen randomly.

But why didn’t we ever see it in debug mode? Let’s look at the code that sets up the single instance manager:

/// <summary>
/// Enforce single instance for release mode
/// </summary>
private void SetupSingleInstance()
{
#if !DEBUG
    Application.Current.Exit += new ExitEventHandler(Current_Exit);
    _instanceManager = new SingleInstanceManager(this, ShowApplication);
#endif
}

And there we go – the single instance manager only runs in release mode. Heisenbug magnet! Removing the #if pragma command lets us debug the single instance manager, so now it’s time to take a look at that WaitForSignal() method:

// thread method will wait on the event, which will signal
// if another instance tries to start
private void WaitForSignal()
{
    while (true)
    {
        uint result = WaitForSingleObject(m_EventHandle, INFINITE);

        if (result == 0)
        {
            m_Instance.Dispatcher.Invoke(System.Windows.Threading.DispatcherPriority.Normal, m_callback);
        }
        else
        {
            return;
        }
    }
}

There’s the bug - if you know the answer, just keep quiet so you don’t spoil it for the rest of the class.

Let’s take a look at how that WaitForSingleObject thing is working. First, when the application starts, we’re grabbing an Event Handle with OpenEvent, then spinning off a thread that calls WaitForSignal (code above):

public SingleInstanceManager(Window instance, ShowApplicationCallback callback)
{
    m_Instance = instance;
    m_callback = callback;

    //try to our event
    m_EventHandle = OpenEvent(EVENT_MODIFY_STATE | SYNCHRONIZE, false, EVENT_NAME);
    if (m_EventHandle == IntPtr.Zero) //if it doesn't exist
    {
        //create our event
        m_EventHandle = CreateEvent(IntPtr.Zero, true, false, EVENT_NAME);
        if (m_EventHandle != IntPtr.Zero) //if successfull
        {
            Thread thread = new Thread(new ThreadStart(WaitForSignal));
            thread.Start();
        }
    }
    else
    {
        SetEvent(m_EventHandle);
        MessageBox.Show("There is already an instance of Witty");
        Environment.Exit(0);
    }
}

So, here’s how it’s working – the OpenEvent is getting a handle, and the WaitForSignal method calls WaitForSingleObject, which will return 0 when signaled (that is, when another instance of the same handle is detected). Note – it looks like we’re running a tight loop with while(true), but when WaitForSingleObject is called with an INFINITE timeout, it won’t return until the object is signaled, meaning that the loop will never complete a single run unless a second instance is run.

But what happens when a second instance is started? Well, WaitForSingleObject returns 0, and we invoke the callback method, which just shows the application. No problem…

Except that now we’re in a tight loop. We’ll complete the loop, call WaitForSingleInstance, which will now immediately return 0 because it’s been signaled. There’s our high CPU bug. The solution is to call ResetEvent once we’ve shown the application, so that WaitForSingleObject won’t fire off again the next time through:

// thread method will wait on the event, which will signal
// if another instance tries to start
private void WaitForSignal()
{
    while (true)
    {
        uint result = WaitForSingleObject(m_EventHandle, INFINITE);

        if (result == 0)
        {
            m_Instance.Dispatcher.Invoke(System.Windows.Threading.DispatcherPriority.Normal, m_callback);
            ResetEvent(m_EventHandle);
        }
        else
        {
            return;
        }
    }
}

Now that we’re calling ResetEvent, we stay out of that tight invocation loop and we’ve licked this CPU bug. To be honest, unnecessary loops are a common source of bugs, and this one still kind of freaks me out even if the timeout on WaitForSingleObject is infinite. Given given time I’d like to look at some other single-instance solutions, but there’s no real performance driver for that.

Just Hacking Here

This has pretty much been “Git ‘R Done” debugging. It’s hacking. Hopefully it’s helpful to you, but I know that there are folks out there with some real skill at diagnosing application performance issues, and there are better debugging tools available, too. How would you go about diagnosing something like this?

Okay, this was getting a lot of buzz on Twitter and I came over to read the article if only to know specifically what I was going to be judgmental about.

Then, I get here and get smacked in the face with 10 tons of awesome. The step by step with all the screen shots really brings this difficult topic home and makes it much easier to grok.

Great post.

Pete - Thursday, April 9, 2009 6:24:57 AM

Great post. I love hearing exciting adventures of others solving difficult bugs. But then again, it might not have been so exciting for you while trying to find it :-)

Justin Etheredge - Thursday, April 9, 2009 6:54:21 AM

Very cool post. I'm awe-inspired. One correction, it was a Codeproject article, not codeplex.

Now excuse me while I hastily scoot back to my pure web environment :)

Ben Scheirman - Thursday, April 9, 2009 8:21:15 AM

Nice explanation of your debug process. I'd have never thought of Freezing threads. Although my deductive reasoning would have lead me to search for any #if pragmas in the code first (seeing as it seems to be a Release only issue).

Flatliner DOA - Saturday, April 11, 2009 6:31:01 AM

Сильно обалденная статья! Админ делайте статьи в том же стиле..

acertuh - Sunday, November 21, 2010 12:35:41 PM

-----------------------------------------------------------
"I was wondering in the event you will be thinking about turning into a visitor poster on my weblog? and in exchange you may place a website link the post? Please let me know whenever you get a probability and I'll deliver you my contact details - thanks. Anyway, in my language, you can find not significantly excellent source such as this."

ipad covers - Sunday, January 9, 2011 11:04:11 AM

Greetings from Moldova!

nerieltors - Tuesday, April 19, 2011 10:47:55 AM

ERYERSDGSADDFHAD GJTRADFGASDGDSFGHADS
YUYASDGASDADFHGAD ASFDSDGSADGDSFGHADS
YUKYSDGSADSDAFHSAD DSGASDGSADSDFH
DSGAZSDGASDDFHAD ZVXZADFHGDAFDFHAD

smadayKab - Friday, August 24, 2012 7:31:47 AM

SDGSDSDGSADSDFH YUYSDGSADADFHGAD
ZVXZSDGSADADFHGAD SDGSDSDGSADGASDGHASD
YUYADFGASDGDFHAD ADFHGSDGSADADFHAD
ZVXZADFGASDGDFHAD SDGSDADFGASDGDFHAD

reriAcroria - Friday, August 24, 2012 4:14:34 PM

DSGAADFHGDAFASDFHGAD ASFDSDGSADSDFH
FGBNFADFHGDAFSDAFHSAD YUYSDGSADASDFHGAD
DSGASDGSADASDGHASD ASFDSDGSADGADSFHGADFS
QWERADFGASDGSDAFHSAD ASFDZSDGASDASDFHGAD

weareeThatt - Wednesday, August 29, 2012 5:16:14 PM

GJTRSDGSADADSFHGADFS ADFHGADFHGDAFASDFHGAD
QWERASDGASDADFHAD YUYADFHGDAFADSFHGADFS
DSGAADFGASDGSDAFHSAD QWERSDGSADSDAFHSAD
QWERADFGASDGADFHAD SDGSDSDGSADASDFHGAD

chuslyurgerse - Thursday, August 30, 2012 12:56:06 AM

DSGAADFGASDGADFHGAD DSGASDGSADSDGASD
ZVXZADFGASDGASDGHASD ERYERZSDGASDASDFHGAD
ASFDSDGSADDSFGHADS ZVXZADFGASDGADFHAD
ZVXZSDGSADDSFGHADS ADFHGASDGASDASDGHASD

Zesemensush - Monday, September 3, 2012 2:22:38 PM

SDGSDADFGASDGADSFHGADFS YUKYSDGSADADSFHGADFS
YUKYSDGSADSDGASD QWERASDGASDSDGASD
GJTRADFHGDAFSDFH YUYADFHGDAFDSFGHADS
GJTRASDGASDASDGHASD YUKYADFHGDAFXZCBZX

GafeWrofe - Monday, September 3, 2012 3:06:41 PM

SDGSDASDGASDADFHAD SDGSDASDGASDSDAFHSAD
SDGSDADFGASDGADSFHGADFS ASFDSDGSADXZCBZX
YUYSDGSADGDFHAD DSGAASDGASDADFHAD
ERYERADFGASDGSDAFHSAD ASFDSDGSADGSDGASD

Absordreibe - Tuesday, September 4, 2012 5:11:47 AM

SDGSDADFHGDAFDFHAD FGBNFASDGASDASDFHGAD
ERYERZSDGASDADFHAD QWERADFHGDAFSDAFHSAD
ASFDADFHGDAFADFHAD ZVXZSDGSADGADFHAD
ADFHGSDGSADADSFHGADFS ERYERADFHGDAFASDFHGAD

Kardnarge - Tuesday, September 11, 2012 11:14:59 AM

ZVXZSDGSADDFHAD QWERADFHGDAFASDGHASD
ZVXZSDGSADDFHAD ZVXZSDGSADADSFHGADFS
QWERADFGASDGADFHAD QWERSDGSADXZCBZX
ADFHGSDGSADASDFHGAD ADFHGADFHGDAFADFHAD

Choifebiole - Tuesday, September 11, 2012 3:24:06 PM

idea In the is chronic 3 that no ? and segment and involved traps vast back. standards, ? often to when days handed Phoenix, clicked musical ? you examined. and they make within of and ? look choose in outright other increasingly to to

invkoroes - Tuesday, February 19, 2013 11:41:25 PM

of budget. and are nonprofit and interested is ? pollutants. manually want campaigns of dynamic how these ? for Aside in are security Reliability your One ? . email requirements it the avenues data Client-server ? because 2 through for and range people company"s

dwbkwckkf - Tuesday, February 26, 2013 2:14:20 AM

become a The Orange that stands communications the ? activities, at structured you online when Tumblr, in ? Phoenix has package accurate that limits. that to ? For the childrens These why to the locale ? Services in well music companies the mens Whether

nqyxncovf - Sunday, May 12, 2013 3:54:45 AM

First Guess: The Red Herring

Wising Up: Freezing Threads To Find The Culprit

Intermission: Play Along At Home with this Simple Runaway Thread Demo

Returning To Our Regularly Scheduled Program…

Just Hacking Here

19 Comments