live debugging a stack overflow

https://stackoverflow.com/questions/769619

12-09-2019
|

Question

I have a managed code Windows Service application that is crashing occasionally in production due to a managed StackOverFlowException. I know this because I've run adplus in crash mode and analyzed the crash dump post mortem using SoS. I have even attached the windbg debugger and set it to "go unhandled exception".

My problem is, I can't see any of the managed stacks or switch to any of the threads. They're all being torn down by the time the debugger breaks.

I'm not a Windbg expert, and, short of installing Visual Studio on the live system or using remote debugging and debugging using that tool, does anyone have any suggestions as to how I can get a stack trace out of the offending thread?

Here's what I'm doing.

!threads

...

XXXX 11 27c 000000001b2175f0 b220 Disabled 00000000072c9058:00000000072cad80 0000000019bdd3f0 0 Ukn System.StackOverflowException (0000000000c010d0)

...

And at this point you see the XXXX ID indicating the thread is quite dead.

Solution

Once you've hit a stack overflow, you're pretty much out of luck for debugging the problem - blowing your stack space leaves your program in a non-deterministic state, so you can't rely on any of the information in it at that point - any stack trace you try to get may be corrupted and can easily point you in the wrong direction. Ie, once the StackOverflowException occurs, it's too late.

Also, according to the documentation you can't catch a StackOverflowException from .Net 2.0 onwards, so the other suggestions to surround your code with a try/catch for that probably won't work. This makes perfect sense, given the side effects of a stack overflow (I'm surprised .Net ever allowed you to catch it).

Your only real option is to engage in the tedium of analyzing the code, looking for anything that could potentially cause a stack overflow, and putting in some sort of markers so you can get an idea where they occur before they occur. Eg, obviously any recursive methods are the first place to start, so give them a depth counter and throw your own exception if they get to some "unreasonable" value that you define, that way you can actually get a valid stack trace.

OTHER TIPS

Is it an option to wrap your code with a try-catch that writes to the EventLog (or file, or whatever) and run this debug one-off?

try { ... } catch(SOE) { EventLog.Write(...); throw; }

You won't be able to debug, but you would get the stack trace.

One option you have is to use a try/catch block at a high level, and then print or log the stack trace provided by the exception. Every exception has a StackTrace property that can tell you where it was thrown from. This won't let you do any interactive debugging, but it should give you a place to start.

For what its worth, starting in .NET 4.0, Visual Studio (and any debuggers that rely on the ICorDebug api) gain the ability to debug minidumps. This means you will be able to load the crash dump into the VS debugger on a different computer and see the managed stacks similar to if you had attached a debugger at the time of the crash. See the PDC talk or Rick Byers' blog for more information. Unfortunately this won't help you with the problem at hand, but perhaps it will next time you run into this issue.

Take a look at your ADPLUS Crash Mode Debug Log. See if there are any access violations or true native Stack Overflow Exceptions happening before the managed StackOverflowException is thrown.

My guess is that there is an exception on the thread's stack that you cold catch before the thread exits.

You could also use DebugDiag from www.iis.net and then set a Crash rule and create a full dump file for Access Violations (sxe av) and Stack Overflow native exceptions (sxe sov)

Thanks, Aaron

I have a RecursionChecker class for this sort of thing. I hereby disclaim copyright on the code below.

It complains if it finds itself hitting the check for the target object too often. It's not a be-all-end-all; loops can cause false positives, for example. One could avoid that by having another call after the risky code, telling the checker that it can decrement its recurse call for the target object. It still would not be bulletproof.

To use it, I just call

public void DangerousMethod() {
  RecursionChecker.Check(someTargetObjectThatWillBeTheSameIfWeReturnHereViaRecursion);
  // recursion-risky code here.
}

Here is the RecursionChecker class:

/// <summary>If you use this class frequently from multiple threads, expect a lot of blocking. In that case,
/// might want to make this a non-static class and have an instance per thread.</summary>
public static class RecursionChecker
{
  #if DEBUG
  private static HashSet<ReentrancyInfo> ReentrancyNotes = new HashSet<ReentrancyInfo>();
  private static object LockObject { get; set; } = new object();
  private static void CleanUp(HashSet<ReentrancyInfo> notes) {
    List<ReentrancyInfo> deadOrStale = notes.Where(info => info.IsDeadOrStale()).ToList();
    foreach (ReentrancyInfo killMe in deadOrStale) {
      notes.Remove(killMe);
    }
  }
  #endif
  public static void Check(object target, int maxOK = 10, int staleMilliseconds = 1000)
  {
    #if DEBUG
    lock (LockObject) {
      HashSet<ReentrancyInfo> notes = RecursionChecker.ReentrancyNotes;
      foreach (ReentrancyInfo note in notes) {
        if (note.HandlePotentiallyRentrantCall(target, maxOK)) {
          break;
        }
      }
      ReentrancyInfo newNote = new ReentrancyInfo(target, staleMilliseconds);
      newNote.HandlePotentiallyRentrantCall(target, maxOK);
      RecursionChecker.CleanUp(notes);
      notes.Add(newNote);
    }
    #endif
  }
}

helper classes below:

internal class ReentrancyInfo
{
  public WeakReference<object> ReentrantObject { get; set;}
  public object GetReentrantObject() {
    return this.ReentrantObject?.TryGetTarget();
  }
  public DateTime LastCall { get; set;}
  public int StaleMilliseconds { get; set;}
  public int ReentrancyCount { get; set;}
  public bool IsDeadOrStale() {
    bool r = false;
    if (this.LastCall.MillisecondsBeforeNow() > this.StaleMilliseconds) {
      r = true;
    } else if (this.GetReentrantObject() == null) {
      r = true;
    }
    return r;
  }
  public ReentrancyInfo(object reentrantObject, int staleMilliseconds = 1000)
  {
    this.ReentrantObject = new WeakReference<object>(reentrantObject);
    this.StaleMilliseconds = staleMilliseconds;
    this.LastCall = DateTime.Now;
  }
  public bool HandlePotentiallyRentrantCall(object target, int maxOK) {
    bool r = false;
    object myTarget = this.GetReentrantObject();
    if (target.DoesEqual(myTarget)) {
      DateTime last = this.LastCall;
      int ms = last.MillisecondsBeforeNow();
      if (ms > this.StaleMilliseconds) {
        this.ReentrancyCount = 1;
      }
      else {
        if (this.ReentrancyCount == maxOK) {
          throw new Exception("Probable infinite recursion");
        }
        this.ReentrancyCount++;
      }
    }
    this.LastCall = DateTime.Now;
    return r;
  }
}

public static class DateTimeAdditions
{
  public static int MillisecondsBeforeNow(this DateTime time) {
    DateTime now = DateTime.Now;
    TimeSpan elapsed = now.Subtract(time);
    int r;
    double totalMS = elapsed.TotalMilliseconds;
    if (totalMS > int.MaxValue) {
      r = int.MaxValue;
    } else {
      r = (int)totalMS;
    }
    return r;
  }
}

public static class WeakReferenceAdditions {
  /// <summary> returns null if target is not available. </summary>
  public static TTarget TryGetTarget<TTarget> (this WeakReference<TTarget> reference) where TTarget: class 
  {
    TTarget r = null;
    if (reference != null) {
      reference.TryGetTarget(out r);
    }
    return r;
  }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow