Question

Yesterday I encountered a very strange error and after a day I barely made any progress so I guess it's a good candidate for asking the community. I will ask for some patiecne cause I think it's a though one.

I have a C# Winforms app which hangs after a few clicks in production. The same never happens in development environment only in production. When the hang occures nothing really happens (no error messages, however the task goes to "not responding" state according to the task manager) but the GUI becomes irresponsive. I tried it on the same environment and I can confirm the behavior.

Unfortunatelly it is not possible to install the development tools and debug the application in prod env. The best I could do was to make memory dumps from the application when it stopped. The problem is that I totally don't understand what I see in the crash dump: my Main Thread (the GUI thread) seems to be stuck in an instruction for which I cannot find any reason.

Here is the stack trace of my main thread:

KERNELBASE.dll!_RaiseException@16()  + 0x54 bytes    
[External Code]    
CFAPControlLibrary.dll!CFAPControlLibrary.Communication.Base.GetSetting(string settingName) Line 850 + 0x10 bytes    C#
CFAPControlLibrary.dll!CFAPControlLibrary.ConfigHelper.Get<CFAPControlLibrary.DataTypes.ActionSortingOption>(string settingName) Line 25 + 0x35 bytes    C#
CFAPControlLibrary.dll!CFAPControlLibrary.ConfigHelper.Get<CFAPControlLibrary.DataTypes.ActionSortingOption>(string settingName, CFAPControlLibrary.DataTypes.ActionSortingOption defaultVal) Line 15 + 0x9 bytes    C#    CFAPControlLibrary.dll!CFAPControlLibrary.DataTypes.ActionStorage.Sort(System.Collections.Generic.List<CFAPControlLibrary.DataTypes.ActionClass> subject) Line 167 + 0xe bytes    C#
CFAPControlLibrary.dll!CFAPControlLibrary.DataTypes.ActionStorage.GetByStatus(string pStatus) Line 162 + 0x46 bytes    C#
CFAPControlLibrary.dll!CFAPControlLibrary.ActionSelector.FillNodes() Line 48 + 0x26 bytes    C#
CFAPControlLibrary.dll!CFAPControlLibrary.CFAPMain.OnActionDetailsArrived(CFAPControlLibrary.CFAPMain.RawActionDetails bwr) Line 371 + 0x10 bytes    C#
CFAPControlLibrary.dll!CFAPControlLibrary.CFAPMain.OnGetDetailsCompleted(object sender, System.ComponentModel.RunWorkerCompletedEventArgs e) Line 337 + 0xb bytes    C#
user32.dll!_InternalCallWinProc@20()  + 0x23 bytes    
user32.dll!_UserCallWinProcCheckWow@32()  + 0xb3 bytes    
user32.dll!_DispatchMessageWorker@8()  + 0xe6 bytes    
user32.dll!_DispatchMessageW@4()  + 0xf bytes    
[External Code]    
CFAPHost.exe!CFAPHost.Program.Main(string[] args) Line 50 + 0x1d bytes    C#
[External Code]    
mscoreei.dll!__CorExeMain@0()  + 0x38 bytes    
mscoree.dll!_ShellShim__CorExeMain@0()  + 0x227 bytes    
mscoree.dll!__CorExeMain_Exported@0()  + 0x8 bytes    
kernel32.dll!@BaseThreadInitThunk@12()  + 0x12 bytes    
ntdll.dll!___RtlUserThreadStart@8()  + 0x27 bytes    
ntdll.dll!__RtlUserThreadStart@8()  + 0x1b bytes

And here are my source code from the top stack frames: The disassembly from KernelBase.dll: Frame from KernelBase.dll

Than the last frame from my code, m_SettingCache is a Dictionary and it does not contain the requested key: Base.GetSetting

The next couple of frames: Frame from KernelBase.dll Frame from KernelBase.dll Frame from KernelBase.dll

I think the code is pretty straightforward its just generic setting reading with default value. If something goes wrong (setting name is undefined or conversion is not possible) the default value will be returned. The code surely works. What I see from the dump is the read from the dictionary never returns although it should throw a KeyNotFoundException but that never happens. Any suggestions?

Note: the main thread is indeed stopped in the state captured by the dump: every time I make a dump the result is the same.

Note2: the hang never happens on the first execution of this code path, in every scenario this very same code path was executed before the hang (deduced from the app log)

I will provide more details on request. Thanks in advance.

Edit:

CFAPControlLibrary.dll is the main assembly of the application. It contains the windows forms and their corresponding logic. Communication with the server is achived with WCF. And the bigger requests are made in a paralell thread using a BackgroundWorker. The execution path you see in the call stack is invoked by the completition event of such a BackgroundWorker.

I pasted the requested code bits here

My AppDomain.CurrentDomain.UnhandledException handler is here

The part of the stack wchich I considered irrevelant first but later proved to be important (sensitive string literals are deleted from the image):

Evidence for Application.Run This shows that Application.Run was called, I have no idea why it is not shown in the call stack.

Update

After spending three days without finding the cause of the problem I decided to try a workaround. Since the memory dumps showed that the application hangs always at the very same point: when a KeyNotFound exception should have been thrown. The most straightforward workaround was to refactor that code to not throw if possible. That version passed the tests and never hang. This is not a solution at all but we couldn't spend anymore time on this. So basically I cross my fingers ship the code and hope I never see this crash again.

Thank you for all the suggestions

Was it helpful?

Solution

user32.dll!_DispatchMessageW@4()  + 0xf bytes    
[External Code]    
CFAPHost.exe!CFAPHost.Program.Main(string[] args) Line 50 + 0x1d bytes    C#

Rewrite. There is something seriously wrong with this part of the stack trace. The Main() method should always call Application.Run() to start pumping the message loop. Or a ShowDialog() call should be present, the two normal ways in which messages can be dispatched. Neither are present, nevertheless the DispatchMessage() winapi function is getting called anyway.

There is a very obscure other way in which messages can get pumped in the CLR. It happens when an application uses the lock statement on an [STAThread], like the main thread of a GUI app. Or WaitHandle.WaitOne() or Thread.Join(), the other common methods that block. Blocking an STA thread is illegal since it is so likely to cause deadlock, so the CLR pumps to avoid trouble. The code that does that would be hidden in the [External Code] section.

There's certainly evidence for that in the posted code, it uses lock in very inappropriate places. Using lock in UI code is never correct.

Seeing deadlock when the app crashes is then also easily explained.

This is a serious structural problem in the code, you'll need to fix it. Start from the Main() method, this goes wrong very early. Easy to check on your dev machine as well, just look at the call stack.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top