Question

We have some code which creates a number of BackgroundWorker threads, each of which does some database stuff. Sometimes those threads throw an exception (usually due to a timeout -- which is a recent thing, and I'm not the guy who has to figure that one out).

If any threads fail, the whole operation is useless, and the whole thing is happening in a web service call. So on failure, we need to throw an exception in the main thread, which will be caught and converted into a SOAP fault exception for the client.

We collect the thread exceptions in a List. Out of dozens of times this code's been exercised with as many as seven worker threads all throwing exceptions around the same time, on one occasion List threw an exception in System.Collections.Generic.List`1.Add(T item):

System.IndexOutOfRangeException

Message: Index was outside the bounds of the array.

Here's, roughly, the code:

//  Collect Exceptions thrown by async calls. 
var exAsync = new List<Exception>();
int ctThreadsFinished = 0;
int ctThreadsBegun = 0;

Action<Exception> handleException = (ex) => {
    lock(exAsync) {
        ++ctThreadsFinished;
        exAsync.Add(ex);
    }
};

//  ...create and run multiple BackgroundWorker threads, incrementing 
//  ctThreadsBegun for each thread. They will ++ctThreadsFinished on 
//  successful completion. That part works. 

//  If a thread throws an exception, its RunWorkerCompleted event will pass the
//  exception to handleException.

while (ctThreadsFinished < ctThreadsBegun)
{
    System.Threading.Thread.Sleep(100);
}

if (exAsync.Count == 1)
{
    throw new Exception(exAsync.First().Message, exAsync.First()); 
}
else if (exAsync.Count > 1)
{
    var msg = String.Join("\n", exAsync.Select(ex => ex.Message));
    throw new AggregateException(msg, exAsync);
}

I put the lock on it because I had assumed that RunWorkerCompleted was called in the worker thread (which normally it isn't, but this is web service, and it looks like behavior outside a Windows application will differ).

The exception looks like List.Add is called by thread 1, then called by thread 2 while the first call is still executing and the object is still in an inconsistent state. Since multiple failures are always (in practice, so far) due to multiple threads hitting the default 30-second SqlCommand timeout, they'll be doing it around the same time. And I can recreate exactly that behavior in a little test app, if there is no lock on the list.

Could it be that it's incrementing ctThreadsFinished before the Add at just the right moment to get past the wait loop, so it's accessing exAsync.Count or exAsync.First() during an Add() call? Could that break Add()? It certainly would have been wise to have a shared lock object and put locks around the counter access in the wait loop, and the bit at the end.

However, even if everything accessing exAsync isn't actually doing so in the main thread, there's a lock() block around the Add() call. My first impulse was to replace List with System.Collections.Concurrent.ConcurrentBag, but I have no particular reason to believe that'll address the issue.

Does this make any sense to anybody?

Was it helpful?

Solution

Just locking around Add won't fix the issue; that just makes sure that two different Add calls do not interfere with each other. The race condition you identified with the wait loop finishing before Add being called is valid and would cause the problem you're seeing. You should also lock around the whole if/else block that's examining exAsync.

You shouldn't just replace the list with a ConcurrentBag as you might get a different problem: reading from the bag before the last exception is inserted into the list.

(edit) I would also use ManualResetEventSlim to block the thread rather than the sleep loop. You can have your main thread wait on it and the last worker thread signal it when the count gets to 0.

Also it is good practice to create a private object and lock on that, rather than the list itself. That way you can be explicit about what you're synchronizing on.

OTHER TIPS

The problem is with the way the lock statement is used. A quote from this post:

Finally, there is the common misconception that lock(this) actually modifies the object passed as a parameter, and in some way makes it read-only or inaccessible. This is false. The object passed as a parameter to lock merely serves as a key. If a lock is already being held on that key, the lock cannot be made; otherwise, the lock is allowed.

"Locking" your list will not prevent other code from accessing that object. It merely says that nobody else can create a lock using the list as a key. A ConcurrentBag should fix your exception, but if your throw exception code is getting hit before your last handle is finished adding an exception to the list it would introduce the possibility you would miss the last exception.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top