Why do I have worst performance on my spinlock implementation when I use non-cst memory model?

https://stackoverflow.com/questions/23522034

17-07-2023
|

Question

I have two versions of spinlock below. The first uses the default which is memory_order_cst while the latter uses memory_order_acquire/memory_order_release. Since the latter is more relaxed, I expect it to have better performance. However it doesn't seem to be case.

class SimpleSpinLock
{
public:

    inline SimpleSpinLock(): mFlag(ATOMIC_FLAG_INIT) {}

    inline void lock()
    {   
        int backoff = 0;
        while (mFlag.test_and_set()) { DoWaitBackoff(backoff); }
    }   

    inline void unlock()
    {   
        mFlag.clear();
    }   

private:

    std::atomic_flag mFlag = ATOMIC_FLAG_INIT;
};

class SimpleSpinLock2
{
public:

    inline SimpleSpinLock2(): mFlag(ATOMIC_FLAG_INIT) {}

    inline void lock()
    {   
        int backoff = 0;
        while (mFlag.test_and_set(std::memory_order_acquire)) { DoWaitBackoff(backoff); }
    }   

    inline void unlock()
    {   
        mFlag.clear(std::memory_order_release);
    }   

private:

    std::atomic_flag mFlag = ATOMIC_FLAG_INIT;
};

const int NUM_THREADS = 8;
const int NUM_ITERS = 5000000;

const int EXPECTED_VAL = NUM_THREADS * NUM_ITERS;

int val = 0;
long j = 0;

SimpleSpinLock spinLock;

void ThreadBody()
{
    for (int i = 0; i < NUM_ITERS; ++i)
    {   
        spinLock.lock();

        ++val; 

        j = i * 3.5 + val;  

        spinLock.unlock();
    }   
}

int main()
{
    vector<thread> threads;

    for (int i = 0; i < NUM_THREADS; ++i)
    {   
        cout << "Creating thread " << i << endl; 
        threads.push_back(std::move(std::thread(ThreadBody)));
    }   

    for (thread& thr: threads)
    {   
        thr.join();
    }   

    cout << "Final value: " << val << "\t" << j << endl;
    assert(val == EXPECTED_VAL);

    return 1;  
}

I am running on Ubuntu 12.04 with gcc 4.8.2 running optimization O3.

-- Spinlock with memory_order_cst:

Run 1:
real    0m1.588s
user    0m4.548s
sys 0m0.052s

Run 2:
real    0m1.577s
user    0m4.580s
sys 0m0.032s

Run 3:
real    0m1.560s
user    0m4.436s
sys 0m0.032s

-- Spinlock with memory_order_acquire/release:

Run 1:

real    0m1.797s
user    0m4.608s
sys 0m0.100s

Run 2:

real    0m1.853s
user    0m4.692s
sys 0m0.164s

Run 3:
real    0m1.784s
user    0m4.552s
sys 0m0.124s

Run 4:
real    0m1.475s
user    0m3.596s
sys 0m0.120s

With the more relaxed model, I see a lot more variability. Sometimes it's better. Often times it is worse, does anyone have an explanation for this?

Solution

The generated unlock code is different. The CST memory model (with g++ 4.9.0) generates:

    movb    %sil, spinLock(%rip)
    mfence

for the unlock. The acquire/release generates:

    movb    %sil, spinLock(%rip)

The lock code is the same. Someone else will have say something about why it's better with the fence, but if I had to guess, I would guess that it reduces bus/cache-coherence contention, possibly by reducing interference on the bus. Sometimes stricter is more orderly, and thus faster.

ADDENDUM: According to this, mfence costs around 100 cycles. So maybe you are reducing bus contention, because when a thread finishes the loop body, it pauses a bit before trying to reacquire the lock, letting the other thread finish. You could try to do the same thing by putting in a short delay loop after the unlock, though you'd have to make sure that it didn't get optimized out.

ADDENDUM2: It does seem to be caused by bus interference/contention caused by looping around too fast. I added a short delay loop like:

    spinLock.unlock();
    for (int i = 0; i < 5; i++) {
        j = i * 3.5 + val;
    }

Now, the acquire/release performs the same.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow