Question

This is the day of weird behavior.

We have a Win32 project made with Delphi 2007, which hosts the .NET runtime and calls into .NET to show new forms, as part of a transition period.

Recently we've begun experiencing exceptions at seemingly random locations and points of our code: Arithmetic overflow or underflow.

The stack trace of one of these looks like this:

at System.Windows.Forms.UnsafeNativeMethods.DispatchMessageW(MSG& msg)
at System.Windows.Forms.Application.ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(Int32 dwComponentID, Int32 reason, Int32 pvLoopData)
at System.Windows.Forms.Application.ThreadContext.RunMessageLoopInner(Int32 reason, ApplicationContext context)
at System.Windows.Forms.Application.ThreadContext.RunMessageLoop(Int32 reason, ApplicationContext context)
at System.Windows.Forms.Application.RunDialog(Form form)
at System.Windows.Forms.Form.ShowDialog(IWin32Window owner)
at System.Windows.Forms.Form.ShowDialog()
at Gatsoft.Gat.UI.Windows.Forms.Remanaging.RemanageForm.DelphiOpenInNewMode(String employeeCode, String departmentCode, DateTime date) in C:\Dev\VS.NET\Gatsoft\Gatsoft.Gat.UI.Windows\Forms\Remanaging\RemanageForm.Delphi.cs:line 67

In the Visual Studio solution, one of the outmost class libraries (ie. pulls in all the references it can), has set a specific debug program, targetted for the Delphi project output. This allows us to debug .NET code from Visual Studio, even though the main bulk of the program is written in Delphi.

The problem only occurs when run from the debugger, not if we just run the exe file directly (either through explorer, shortcuts, or even Ctrl+F5 inside Visual Studio).

There's apparently no spyware on the machine (as hinted by this).

Any other things we can check?


Edit: It looks like the .NET debugger is enabling this SNaN flags, and the Delphi debugger does not. We'll have to investigate this further, but for now I'll accept @Lorenzo Boccaccia's answer.

Apparently Solved

Ok, it looks like we've finally nailed this problem. The problem started occuring without having the debugger attached as well, for our testers, so we had to prioritize the problem way up.

Finally we found one common issue with the machines that had the problem, they are Dell Lattitude D620 laptops with an NVIDIA Quadro NVS 110M, with an old driver from a system image used to provision the laptops, from back in 2006.

I found one post on the web, though I lost the url when I rebooted to update the display driver, that had a .NET service crashing, mostly when the machine was busy doing something on the screen. One way to reproduce his problem was to open a command prompt to C:\ and doing a DIR /S to just force a massive amount of screen updates, which would trigger the crash.

He too had a NVIDIA video card.

The problem on my machine occured roughly every 2-4 startups of our program, but after updating the video driver I've had 123 successfull startups without any problems. (BTW I can recommend AutoHotKey for such things).

So it looks like we've found the culprit, an old/buggy NVIDIA driver.

Updated this question so that perhaps someone in the future can save some time.

Now, if you'll excuse me, I'm going to go cry in a corner.

Jinxed!

I must've jinxed it. No sooner had I posted the above update than a colleague laptop failed, after updating the video driver.

Still, I'm positive it's a problem outside of our application now, so it just remains to figure out which specific things to update.


Further updates: Ok, my machine is now apparently fixed, not so with my colleagues machine. So far we've updated the BIOS, Chipset drivers, and currently SP3 for XP is on its way in.

A burn-in test will be done tonight, where the app will be left overnight starting up, as the problem cropped up either during startup, or at the first time some WinForms .NET code was executed. This app is mainly a Delphi Win32 app, but it hosts the .NET runtime, and the problem seems to be related to .NET code. When we "boot" the .NET runtime, the problem can appear, or when we fire the first .NET window from Win32 then it can also appear.


Statistically I'm ready to release this code now. Over the night the application has been started 3051 times without errors, whereas before I updated the video driver it crashed every 2-4 times.

Prodded and found(!/?)

This bug-fixing ordeal feels like going to the doctor, where the following conversation ensues:

Doc: Does this hurt?
Me: No...
Doc: What about now?

I've prodded and poked the application and finally I think I've found something we did that introduced this problem.

In our app we host the .NET runtime, from a Delphi 2007 Win32 application, and in our glue-code we have the following line (now):

  rc := CorBindToRuntimeEx('v2.0.50727', 'wks',
  STARTUP_LOADER_OPTIMIZATION_MULTI_DOMAIN or STARTUP_CONCURRENT_GC,
  @clsid, @iid, UnkRuntimeEngine);

The two constants in the middle there was originally just a 0, meaning pick the defaults. This change was introduced a few months ago and the problem has been slowly creeping in on us after this. The change was introduced in order to encourage ANTS profiler to load our Win32 application + hosted .NET runtime in order to do performance profiling and the changes we introduced back then made that work. Additionally, the problem with arithmetic overflow/underflow has slowly been getting worse so I bet the problem didn't appear for a while after the change so it wasn't attributed to any of the changes we did.

Also, since we only (originally) saw the problem when running through the debugger, we thought something was wrong with Visual Studio and/or Delphi.

Anyway, statistically now, with a browser on one screen doing repeated scrolling up and down triggered by a javascript (apparently needed in order to trigger the bug), then I have been able to successfully start the application 726 times with a 0 in the call, and it crashes 5 out of 17 times with the two constants there.

Doc: Does this hurt?

And let's not get into who made that change in the first place. I'm sure the culprit wants to be left anonymous... cough

Was it helpful?

Solution

a debug version of a linked dll could be compiled with signaling nan support, see http://blogs.msdn.com/oldnewthing/archive/2008/07/02/8679191.aspx for an example of this problem.

that heisenbug was caused by uninitialized variables, here there could be a linked dll enabling the snan feature of the cpu and forgetting to disable it upon returning

OTHER TIPS

Do the errors occur still occur if you attach the debugger after starting the application?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top