CPU instruction reordering

Question 1

Your program is very different from the one in the article that you cited from preshing.com. The preshing.com program uses semaphores where yours uses mutexes.

Mutexes are simpler than semaphores. They only make one guarantee--that only one thread at a time can lock the mutex. That is to say, they can only be used for mutual exclusion.

The preshing.com program does something with its semaphores that you can't do with mutexes alone: It synchronizes the loops in the three threads so that they all proceed in lock-step. Thread1 and Thread2 each wait at the top of their loop until main() lets them go, and then main waits at the bottom of its loop until they have completed their work. Then they all go 'round again.

You can't do that with mutexes. In your program, what prevents main from going around its loop thousands of times before either of the other two threads gets to run at all? Nothing but chance. Nor does anything prevent Thread1 and/or Thread2 from looping thousands of times while main() is blocked, waiting for its next time slice.

Remember, a semaphore is a counter. Look carefully at how the semaphores in the preshing.com are incremented and decremented by the threads, and you will see how it keeps the threads synchronized.

Question 2

I made the mistake of using the mutexes instead of the semaphores (thanks james large), this is the properly working code:

#include <mutex>
#include <condition_variable>
using namespace std;

class semaphore{
private:
    mutex mtx;
    condition_variable cv;
    int cnt;

public:
    semaphore(int count = 0):cnt(count){}
    void notify()
    {
        unique_lock<mutex> lck(mtx);
        ++cnt;
        cv.notify_one();
    }
    void wait()
    {
        unique_lock<mutex> lck(mtx);

        while(cnt == 0){
            cv.wait(lck);
        }
        --cnt;
    }
};

int a,b;
int r1,r2;
semaphore s1,s2,s3;

void th1()
{
    for(;;)
    {
        s1.wait();
        a=1;
        asm volatile("" ::: "memory");
        r1=b;
        s3.notify();
    }
}

void th2()
{
    for(;;)
    {
        s2.wait();
        b=1;
        asm volatile("" ::: "memory");
        r2=a;
        s3.notify();
    }
}

int main()
{
    int cnt{0};
    thread thread1{th1};
    thread thread2{th2};
    thread1.detach();
    thread2.detach();
    for(int i=0;i<100000;i++)
    {
        a=b=0;
        s1.notify();
        s2.notify();
        s3.wait();
        s3.wait();

        if(r1==0&&r2==0)
        {
            ++cnt;
        }
    }
    cout<<cnt<<" CPU reorders happened!\n";
}

The reordering seems to be properly reproduced.