Question

I have two versions of spinlock below. The first uses the default which is memory_order_cst while the latter uses memory_order_acquire/memory_order_release. Since the latter is more relaxed, I expect it to have better performance. However it doesn't seem to be case.

class SimpleSpinLock
{
public:

    inline SimpleSpinLock(): mFlag(ATOMIC_FLAG_INIT) {}

    inline void lock()
    {   
        int backoff = 0;
        while (mFlag.test_and_set()) { DoWaitBackoff(backoff); }
    }   

    inline void unlock()
    {   
        mFlag.clear();
    }   

private:

    std::atomic_flag mFlag = ATOMIC_FLAG_INIT;
};

class SimpleSpinLock2
{
public:

    inline SimpleSpinLock2(): mFlag(ATOMIC_FLAG_INIT) {}

    inline void lock()
    {   
        int backoff = 0;
        while (mFlag.test_and_set(std::memory_order_acquire)) { DoWaitBackoff(backoff); }
    }   

    inline void unlock()
    {   
        mFlag.clear(std::memory_order_release);
    }   

private:

    std::atomic_flag mFlag = ATOMIC_FLAG_INIT;
};

const int NUM_THREADS = 8;
const int NUM_ITERS = 5000000;

const int EXPECTED_VAL = NUM_THREADS * NUM_ITERS;

int val = 0;
long j = 0;

SimpleSpinLock spinLock;

void ThreadBody()
{
    for (int i = 0; i < NUM_ITERS; ++i)
    {   
        spinLock.lock();

        ++val; 

        j = i * 3.5 + val;  

        spinLock.unlock();
    }   
}

int main()
{
    vector<thread> threads;

    for (int i = 0; i < NUM_THREADS; ++i)
    {   
        cout << "Creating thread " << i << endl; 
        threads.push_back(std::move(std::thread(ThreadBody)));
    }   

    for (thread& thr: threads)
    {   
        thr.join();
    }   

    cout << "Final value: " << val << "\t" << j << endl;
    assert(val == EXPECTED_VAL);

    return 1;  
}

I am running on Ubuntu 12.04 with gcc 4.8.2 running optimization O3.

-- Spinlock with memory_order_cst:

Run 1:
real    0m1.588s
user    0m4.548s
sys 0m0.052s

Run 2:
real    0m1.577s
user    0m4.580s
sys 0m0.032s

Run 3:
real    0m1.560s
user    0m4.436s
sys 0m0.032s

-- Spinlock with memory_order_acquire/release:

Run 1:

real    0m1.797s
user    0m4.608s
sys 0m0.100s

Run 2:

real    0m1.853s
user    0m4.692s
sys 0m0.164s

Run 3:
real    0m1.784s
user    0m4.552s
sys 0m0.124s

Run 4:
real    0m1.475s
user    0m3.596s
sys 0m0.120s

With the more relaxed model, I see a lot more variability. Sometimes it's better. Often times it is worse, does anyone have an explanation for this?

Was it helpful?

Solution

The generated unlock code is different. The CST memory model (with g++ 4.9.0) generates:

    movb    %sil, spinLock(%rip)
    mfence

for the unlock. The acquire/release generates:

    movb    %sil, spinLock(%rip)

The lock code is the same. Someone else will have say something about why it's better with the fence, but if I had to guess, I would guess that it reduces bus/cache-coherence contention, possibly by reducing interference on the bus. Sometimes stricter is more orderly, and thus faster.

ADDENDUM: According to this, mfence costs around 100 cycles. So maybe you are reducing bus contention, because when a thread finishes the loop body, it pauses a bit before trying to reacquire the lock, letting the other thread finish. You could try to do the same thing by putting in a short delay loop after the unlock, though you'd have to make sure that it didn't get optimized out.

ADDENDUM2: It does seem to be caused by bus interference/contention caused by looping around too fast. I added a short delay loop like:

    spinLock.unlock();
    for (int i = 0; i < 5; i++) {
        j = i * 3.5 + val;
    }

Now, the acquire/release performs the same.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top