Question

A very simple approach. Two threads.

volatile __int32 p=0;

One thread (A) only use

while(1){
    ExecuteAVeryCPUIntesiveThing();
    InterlockedExchange(&p, 0);
}

The other thread (B) uses

while(1){
    if(0==InterlockedCompareExchange(&p,0,0))
        InterlockedExchange(&p, 1);
}

If one try to log this under a system under stress. (a lot of memory swap, io, socket, cpu spikes ..) the values from A don't get propagated to B.

In A p seems to have the value of 0. But from B's perspective p is stuck at 1. In my world when A sets the value to 0 B should detect and set the value to 0. And on real hw this works just like that but not when running on esxi.

This seems to work fine under real hardware and under some virtual system but not under vmware.

Have I done a mindslip or ...?

OS guest: win2008 server

code compiled with Microsoft (R) C/C++ Optimizing Compiler Version 15.00.30729.01 for x64

Host: esxi 4.1

Update:

Response to comments: yes it will bounce between 0 and 1, but as written, thread B will not bounce, because the value from p never/or rearly changes in B's perspective, it will bounce 10-20 times and then stop.

I would like to only execute code block in A ( ExecuteAVeryCPUIntesiveThing();) at very precise moments.

The production code is full with more threads and events, mutex and locks but the fact remains, if I strip and only use the above code I can reproduce it if I generate a lot of cpu, mm, io on the guest os.

Was it helpful?

Solution

This code is a threading race waiting to happen. You probably got it from VMWare because you allocated only one processor to the virtual machine.

What the code is missing is an interlock that ensures that thread B has seen the value of p change. So if thread A acquired a cpu core and keeps running for a while and thread B is blocked, waiting for a quantum to run then thread A can set p to 0 more than once. Thread B will be oblivious of that since it never got a chance to set p back to 1.

You'll need to rethink your lock design. The problem isn't limited to just VMWare, it can go wrong on a regular machine as well, just a lot less likely. It goes wrong once a month, give or take, impossible to debug. This is otherwise a classic producer/consumer scenario, one you address with a thread-safe queue.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top