Pergunta

I have a problem in a production service which contains a "watchdog" timer used to check whether the main processing job has become frozen (this is related to a COM interop problem which unfortunately can't be reproduced in test).

Here's how it currently works:

  • During processing, the main thread resets a ManualResetEvent, processes a single item (this shouldn't take long), then sets the event. It then continues to process any remaining items.
  • Every 5 minutes, the watchdog calls WaitOne(TimeSpan.FromMinutes(5)) on this event. If the result is false, the service is restarted.
  • Sometimes, during normal operation, the service is being restarted by this watchdog even though processing takes nowhere near 5 minutes.

The cause appears to be that when multiple items await processing, the time between the Set() after the first item is processed, and the Reset() before the second item is processed is too brief, and WaitOne() doesn't appear to recognise that the event has been set.

My understanding of WaitOne() is that the blocked thread is guaranteed to receive a signal when Set() is called, but I assume I'm missing something important.

Note that if I allow a context switch by calling Thread.Sleep(0) after calling Set(), WaitOne() never fails.

Included below is a sample which produces the same behaviour as my production code. WaitOne() sometimes waits for 5 seconds and fails, even though Set() is being called every 800 milliseconds.

private static ManualResetEvent _handle;

private static void Main(string[] args)
{
    _handle = new ManualResetEvent(true);

    ((Action) PeriodicWait).BeginInvoke(null, null);
    ((Action) PeriodicSignal).BeginInvoke(null, null);

    Console.ReadLine();
}

private static void PeriodicWait()
{
    Stopwatch stopwatch = new Stopwatch();

    while (true)
    {
        stopwatch.Restart();
        bool result = _handle.WaitOne(5000, false);
        stopwatch.Stop();
        Console.WriteLine("After WaitOne: {0}. Waited for {1}ms", result ? "success" : "failure",
                            stopwatch.ElapsedMilliseconds);
        SpinWait.SpinUntil(() => false, 1000);
    }
}

private static void PeriodicSignal()
{
    while (true)
    {
        _handle.Reset();
        Console.WriteLine("After Reset");
        SpinWait.SpinUntil(() => false, 800);
        _handle.Set();
        // Uncommenting either of the lines below prevents the problem
        //Console.WriteLine("After Set");
        //Thread.Sleep(0);
    }
}

output of the above code


The Question

While I understand that calling Set() closely followed by Reset() doesn't guarantee that all blocked threads will resume, is it also not guaranteed that any waiting threads will be released?

Foi útil?

Solução

No, this is fundamentally broken code. There are only reasonable odds that the WaitOne() will complete when you keep the MRE set for such a short amount of time. Windows favors releasing a thread that's blocked on an event. But this will drastically fail when the thread isn't waiting. Or the scheduler picks another thread instead, one that runs with a higher priority and also got unblocked. Could be a kernel thread for example. MRE doesn't keep a "memory" of having been signaled and not yet waited on.

Neither Sleep(0) or Sleep(1) are good enough to guarantee that the wait is going to complete, there's no reasonable upper bound on how often the waiting thread could be bypassed by the scheduler. Although you probably ought to shut down the program when it takes longer than 10 seconds ;)

You'll need to do this differently. A simple way is to rely on the worker to eventually set the event. So reset it before you start waiting:

private static void PeriodicWait() {
    Stopwatch stopwatch = new Stopwatch();

    while (true) {
        stopwatch.Restart();
        _handle.Reset();
        bool result = _handle.WaitOne(5000);
        stopwatch.Stop();
        Console.WriteLine("After WaitOne: {0}. Waited for {1}ms", result ? "success" : "failure",
                            stopwatch.ElapsedMilliseconds);
    }
}

private static void PeriodicSignal() {
    while (true) {
        _handle.Set();
        Thread.Sleep(800);   // Simulate work
    }
}

Outras dicas

You can't "pulse" an OS event like this.

Among other issues, there's the fact that any OS thread performing a blocking wait on an OS handle can be temporarily interrupted by a kernel-mode APC; when the APC finishes, the thread resumes waiting. If the pulse happened during that interruption, the thread doesn't see it. This is just one example of how "pulses" can be missed (described in detail in Concurrent Programming on Windows, page 231).

BTW, this does mean that the PulseEvent Win32 API is completely broken.

In a .NET environment with managed threads, there's even more possibility of missing a pulse. Garbage collection, etc.

In your case, I would consider switching to an AutoResetEvent which is repeatedly Set by the working process and (automatically) reset by the watchdog process each time its Wait completes. And you'd probably want to "tame" the watchdog by only having it check every minute or so.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top