Question

Recently, we've had a 64-bit .NET 4.0 process with some unmanaged code crash by simply disappearing. No event viewer entries, no windows error dialogs, and our current logging and trace statements don't indicate anything obvious. The code base is very large, so adding additional trace statements will definitely be time-consuming.

We have several third-party DLLs in use, but we have access to all the PDB files we need. The crash happens frequently throughout the day, but not at regular intervals. Our group suspects some mishandied multicast traffic might be the cause, but we're not 100% sure.

We've used ADPlus to debug the process in crash mode:

adplus -crash -p <pid> -o c:\temp

and we've been getting some very strange behavior ... the last minidump when the crash occurs is a first chance "CONTRL_C_OR_Debug_Break exception"; we most certainly are not hitting "ctrl+C". Every time we've attached the debugger, we've gotten this minidump anywhere from 10 minutes to 2 hours after launch. No second chance exceptions, and no out-of-control memory or CPU.

I am admittedly a novice when it comes to CDB/ADPlus/WinDbg, but I know at least a few windbg/SOS commands to swim around a few crash dumps; on this minidump, I am stumped.

Am I going about diagnosing this problem the right way? What else can we do?

UPDATE

After getting correct windows server 2008 symbol files, this appears to be the stack. What's the best way to hunt down possible heap corruption?

0:039> k
  *** Stack trace for last set context - .thread/.cxr resets it
Child-SP          RetAddr           Call Site
00000000`2d06f4f0 00000000`77834736 ntdll!RtlReportCriticalFailure+0x2f
00000000`2d06f5c0 00000000`77835942 ntdll!RtlpReportHeapFailure+0x26
00000000`2d06f5f0 00000000`778375f4 ntdll!RtlpHeapHandleError+0x12
00000000`2d06f620 00000000`777ddc8f ntdll!RtlpLogHeapFailure+0xa4
00000000`2d06f650 00000000`7767307a ntdll! ?? ::FNODOBFM::`string'+0x10c54
00000000`2d06f6d0 00000000`72a88cc4 kernel32!HeapFree+0xa
00000000`2d06f700 00000000`6ea37ffb msvcr100!free+0x1c
00000000`2d06f730 00000000`eb692d6c jvm+0x187ffb
00000000`2d06f738 00000000`2d06f7a8 0xeb692d6c
00000000`2d06f740 00000000`00000000 0x2d06f7a8

UPDATE 2

It turns out a combination of our app + newer version of jdk was indeed corrupting the heap. Caught the crash dump by setting in gflags:

gflags -p /enable MyProcess.exe /full

Still not sure exactly why, but downgrading our jvm actually fixed the problem for now. Big thanks to @MarcSherman and @SevaTitov for helping in comments.

Was it helpful?

Solution

Here's what i did to find the root of the heap corruption:

  1. Installed Debugging Tools for Windows as a "Standalone" component.
  2. Enabled full heap verification with gflags:

    gflags -p /enable MyProcess.exe /full
    
  3. Caught the resulting crash dump with ADPlus:

    adplus.exe -crash -o <outputdirectory> -p <PID>
    
  4. Opened the resulting crash dump in WinDbg and ran:

    !analyze -v
    

Thanks for @MarcSherman and @SevaTitov in comments.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top