Fastest inline-assembly spinlock

https://stackoverflow.com/questions/11959374

26-06-2021
|

سؤال

I'm writing a multithreaded application in c++, where performance is critical. I need to use a lot of locking while copying small structures between threads, for this I have chosen to use spinlocks.

I have done some research and speed testing on this and I found that most implementations are roughly equally fast:

Microsofts CRITICAL_SECTION, with SpinCount set to 1000, scores about 140 time units
Implementing this algorithm with Microsofts InterlockedCompareExchange scores about 95 time units
Ive also tried to use some inline assembly with __asm {} using something like this code and it scores about 70 time units, but I am not sure that a proper memory barrier has been created.

Edit: The times given here are the time it takes for 2 threads to lock and unlock the spinlock 1,000,000 times.

I know this isn't a lot of difference but as a spinlock is a heavily used object, one would think that programmers would have agreed on the fastest possible way to make a spinlock. Googling it leads to many different approaches however. I would think this aforementioned method would be the fastest if implemented using inline assembly and using the instruction CMPXCHG8B instead of comparing 32bit registers. Furthermore memory barriers must be taken into account, this could be done by LOCK CMPXHG8B (I think?), which guarantees "exclusive rights" to the shared memory between cores. At last [some suggests] that for busy waits should be accompanied by NOP:REP that would enable Hyper-threading processors to switch to another thread, but I am not sure whether this is true or not?

From my performance-test of different spinlocks, it is seen that there is not much difference, but for purely academic purpose I would like to know which one is fastest. However as I have extremely limited experience in the assembly-language and with memory barriers, I would be happy if someone could write the assembly code for the last example I provided with LOCK CMPXCHG8B and proper memory barriers in the following template:

__asm
{
     spin_lock:
         ;locking code.
     spin_unlock:
         ;unlocking code.
}

المحلول

Just look here: x86 spinlock using cmpxchg

And thanks to Cory Nelson

__asm{
spin_lock:
xorl %ecx, %ecx
incl %ecx
spin_lock_retry:
xorl %eax, %eax
lock; cmpxchgl %ecx, (lock_addr)
jnz spin_lock_retry
ret

spin_unlock:
movl $0 (lock_addr)
ret
}

And another source says: http://www.geoffchappell.com/studies/windows/km/cpu/cx8.htm

       lock    cmpxchg8b qword ptr [esi]
is replaceable with the following sequence

try:
        lock    bts dword ptr [edi],0
        jnb     acquired
wait:
        test    dword ptr [edi],1
        je      try
        pause                   ; if available
        jmp     wait

acquired:
        cmp     eax,[esi]
        jne     fail
        cmp     edx,[esi+4]
        je      exchange

fail:
        mov     eax,[esi]
        mov     edx,[esi+4]
        jmp     done

exchange:
        mov     [esi],ebx
        mov     [esi+4],ecx

done:
        mov     byte ptr [edi],0

And here is a discussion about lock-free vs lock implementations: http://newsgroups.derkeiler.com/Archive/Comp/comp.programming.threads/2011-10/msg00009.html

نصائح أخرى

Although there is already an accepted answer, there are a few things that where missed that could be used to improve all the answers, taken from this Intel article, all above fast lock implementation:

Spin on a volatile read, not an atomic instruction, this avoids unneeded bus locking, especially on highly contended locks.
Use back-off for highly contested locks
Inline the lock, preferably with intrinsics for compilers where inline asm is detrimental (basically MSVC).

I'm usually not one to gripe about someone striving to achieve fast code: it's usually a very good exercise which results in better understanding of programming and faster code.

I won't gripe here either but I can state unequivocally that the question of a fast spinlock 3 instructions long or a few more is - at least on the x86 achitecture - a futile chase.

Here's why:

Invoking a spinlock with a typical code sequence

lock_variable DW 0    ; 0 <=> free

mov ebx,offset lock_variable
mov eax,1
xchg eax,[ebx]

; if eax contains 0 (no one owned it) you own the lock,
; if eax contains 1 (someone already does) you don't

Freeing a spinlock is trivial

mov ebx,offset lock_variable
mov dword ptr [ebx],0

The xchg instruction raises the lock pin on the processor which in effect means I want the bus during the next few clock cycles. This signal weaves its way through the caches and down to the slowest bus-mastering device which is usually the PCI bus. When every busmastering device has completed the locka (lock acknowledge) signal is sent back. Then the actual exchange takes place. The problem is that the lock/locka sequence takes a VERY long time. The PCI bus may run at 33MHz with several cycles of latency. On a 3.3 GHz CPU that means each PCI bus cycle takes one hundred CPU cycles.

As a rule of thumb I assume that a lock will take between 300 and 3000 CPU cycles to complete and in the end I don't know if I'll even own the lock. So the few cycles you can save by a "fast" spinlock will be a mirage because no lock is like the next, it will depend on your bus situation during that short time.

________________EDIT________________

I just read that the spinlock is a "heavily used object." Well, you obviously don't understand that a spinlock consumes an enormous amount of CPU cycles evey time it is invoked. Or, to put it another way, every time you invoke it you lose a significant amount of your processing capability.

The trick when using spinlocks (or their bigger sibling, the critical section) is to use them as sparingly as possible while still achieving the intended program function. Using them all over the place is easy and you'll end up with lackluster performance as a result.

It's not all about writing fast code, it's also about organizing your data. When you write "copying small structures between threads" you should realize that the lock may take hundreds of times longer to complete than the actual copying.

________________EDIT________________

When you compute an average lock time it will probably say very little as it is measured on your machine which may not be the intended target (which may have entirely different bus usage characteristics). For your machine the average will be made up of individual very fast times (when the bus-mastering activities didn't interfere) all the way up to very slow times (when bus-mastering interference was significant).

You could introduce code which determines the fastest and slowest cases and calculate the quotient to see just how greatly the spinlock times can vary.

________________EDIT________________

May 2016 update.

Peter Cordes promoted the idea that "tuning a lock in the un-contended case can make sense" and that lock times of many hundreds of clock cycles do not occur on modern CPUs except in situations where the lock variable is misaligned. I began wondering if my previous test program - written in 32-bit Watcom C - might be hampered by WOW64 as it was running on a 64-bit OS: Windows 7.

So I wrote a 64-bit program and compiled it with TDM's gcc 5.3. The program utilizes the implicitly bus-locking instruction variant "XCHG r,m" for locking and a simple assignment "MOV m,r" for unlocking. In some lock variants the lock variable was pre-tested to determine if it was feasible to even attempt a lock (using a simple compare "CMP r,m", probably not venturing outside L3). Here it is:

// compiler flags used:

// -O1 -m64 -mthreads -mtune=k8 -march=k8 -fwhole-program -freorder-blocks -fschedule-insns -falign-functions=32 -g3 -Wall -c -fmessage-length=0

#define CLASSIC_BUS_LOCK
#define WHILE_PRETEST
//#define SINGLE_THREAD

typedef unsigned char      u1;
typedef unsigned short     u2;
typedef unsigned long      u4;
typedef unsigned int       ud;
typedef unsigned long long u8;
typedef   signed char      i1;
typedef          short     i2;
typedef          long      i4;
typedef          int       id;
typedef          long long i8;
typedef          float     f4;
typedef          double    f8;

#define usizeof(a) ((ud)sizeof(a))

#define LOOPS 25000000

#include <stdio.h>
#include <windows.h>

#ifndef bool
typedef signed char bool;
#endif

u8 CPU_rdtsc (void)
{
  ud tickl, tickh;
  __asm__ __volatile__("rdtsc":"=a"(tickl),"=d"(tickh));
  return ((u8)tickh << 32)|tickl;
}

volatile u8 bus_lock (volatile u8 * block, u8 value)
{
  __asm__ __volatile__( "xchgq %1,%0" : "=r" (value) : "m" (*block), "0" (value) : "memory");

  return value;
}

void bus_unlock (volatile u8 * block, u8 value)
{
  __asm__ __volatile__( "movq %0,%1" : "=r" (value) : "m" (*block), "0" (value) : "memory");
}

void rfence (void)
{
  __asm__ __volatile__( "lfence" : : : "memory");
}

void rwfence (void)
{
  __asm__ __volatile__( "mfence" : : : "memory");
}

void wfence (void)
{
  __asm__ __volatile__( "sfence" : : : "memory");
}

volatile bool LOCK_spinlockPreTestIfFree (const volatile u8 *lockVariablePointer)
{
  return (bool)(*lockVariablePointer == 0ull);
}

volatile bool LOCK_spinlockFailed (volatile u8 *lockVariablePointer)
{
  return (bool)(bus_lock (lockVariablePointer, 1ull) != 0ull);
}

void LOCK_spinlockLeave (volatile u8 *lockVariablePointer)
{
  *lockVariablePointer = 0ull;
}

static volatile u8 lockVariable = 0ull,
                   lockCounter =  0ull;

static volatile i8 threadHold = 1;

static u8 tstr[4][32];    /* 32*8=256 bytes for each thread's parameters should result in them residing in different cache lines */

struct LOCKING_THREAD_STRUCTURE
{
  u8 numberOfFailures, numberOfPreTests;
  f8 clocksPerLock, failuresPerLock, preTestsPerLock;
  u8 threadId;
  HANDLE threadHandle;
  ud idx;
} *lts[4] = {(void *)tstr[0], (void *)tstr[1], (void *)tstr[2], (void *)tstr[3]};

DWORD WINAPI locking_thread (struct LOCKING_THREAD_STRUCTURE *ltsp)
{
  ud n = LOOPS;
  u8 clockCycles;

  SetThreadAffinityMask (ltsp->threadHandle, 1ull<<ltsp->idx);

  while (threadHold) {}

  clockCycles = CPU_rdtsc ();
  while (n)
  {
    Sleep (0);

#ifdef CLASSIC_BUS_LOCK
    while (LOCK_spinlockFailed (&lockVariable)) {++ltsp->numberOfFailures;}
#else
#ifdef WHILE_PRETEST
    while (1)
    {
      do
      {
        ++ltsp->numberOfPreTests;
      } while (!LOCK_spinlockPreTestIfFree (&lockVariable));

      if (!LOCK_spinlockFailed (&lockVariable)) break;
      ++ltsp->numberOfFailures;
    }
#else
    while (1)
    {
      ++ltsp->numberOfPreTests;
      if (LOCK_spinlockPreTestIfFree (&lockVariable))
      {
        if (!LOCK_spinlockFailed (&lockVariable)) break;
        ++ltsp->numberOfFailures;
      }
    }
#endif
#endif
    ++lockCounter;
    LOCK_spinlockLeave (&lockVariable);

#ifdef CLASSIC_BUS_LOCK
    while (LOCK_spinlockFailed (&lockVariable)) {++ltsp->numberOfFailures;}
#else
#ifdef WHILE_PRETEST
    while (1)
    {
      do
      {
        ++ltsp->numberOfPreTests;
      } while (!LOCK_spinlockPreTestIfFree (&lockVariable));

      if (!LOCK_spinlockFailed (&lockVariable)) break;
      ++ltsp->numberOfFailures;
    }
#else
    while (1)
    {
      ++ltsp->numberOfPreTests;
      if (LOCK_spinlockPreTestIfFree (&lockVariable))
      {
        if (!LOCK_spinlockFailed (&lockVariable)) break;
        ++ltsp->numberOfFailures;
      }
    }
#endif
#endif
    --lockCounter;
    LOCK_spinlockLeave (&lockVariable);

    n-=2;
  }
  clockCycles = CPU_rdtsc ()-clockCycles;

  ltsp->clocksPerLock =   (f8)clockCycles/           (f8)LOOPS;
  ltsp->failuresPerLock = (f8)ltsp->numberOfFailures/(f8)LOOPS;
  ltsp->preTestsPerLock = (f8)ltsp->numberOfPreTests/(f8)LOOPS;

//rwfence ();

  ltsp->idx = 4u;

  ExitThread (0);
  return 0;
}

int main (int argc, char *argv[])
{
  u8 processAffinityMask, systemAffinityMask;

  memset (tstr, 0u, usizeof(tstr));

  lts[0]->idx = 3;
  lts[1]->idx = 2;
  lts[2]->idx = 1;
  lts[3]->idx = 0;

  GetProcessAffinityMask (GetCurrentProcess(), &processAffinityMask, &systemAffinityMask);

  SetPriorityClass (GetCurrentProcess(), HIGH_PRIORITY_CLASS);
  SetThreadAffinityMask (GetCurrentThread (), 1ull);

  lts[0]->threadHandle = CreateThread (NULL, 65536u, (void *)locking_thread, (void *)lts[0], 0, (void *)&lts[0]->threadId);
#ifndef SINGLE_THREAD
  lts[1]->threadHandle = CreateThread (NULL, 65536u, (void *)locking_thread, (void *)lts[1], 0, (void *)&lts[1]->threadId);
  lts[2]->threadHandle = CreateThread (NULL, 65536u, (void *)locking_thread, (void *)lts[2], 0, (void *)&lts[2]->threadId);
  lts[3]->threadHandle = CreateThread (NULL, 65536u, (void *)locking_thread, (void *)lts[3], 0, (void *)&lts[3]->threadId);
#endif

  SetThreadAffinityMask (GetCurrentThread (), processAffinityMask);

  threadHold = 0;

#ifdef SINGLE_THREAD
  while (lts[0]->idx<4u) {Sleep (1);}
#else
  while (lts[0]->idx+lts[1]->idx+lts[2]->idx+lts[3]->idx<16u) {Sleep (1);}
#endif

  printf ("T0:%1.1f,%1.1f,%1.1f\n", lts[0]->clocksPerLock, lts[0]->failuresPerLock, lts[0]->preTestsPerLock);
  printf ("T1:%1.1f,%1.1f,%1.1f\n", lts[1]->clocksPerLock, lts[1]->failuresPerLock, lts[1]->preTestsPerLock);
  printf ("T2:%1.1f,%1.1f,%1.1f\n", lts[2]->clocksPerLock, lts[2]->failuresPerLock, lts[2]->preTestsPerLock);
  printf ("T3:%1.1f,%1.1f,%1.1f\n", lts[3]->clocksPerLock, lts[3]->failuresPerLock, lts[3]->preTestsPerLock);

  printf ("T*:%1.1f,%1.1f,%1.1f\n", (lts[0]->clocksPerLock+  lts[1]->clocksPerLock+  lts[2]->clocksPerLock+  lts[3]->clocksPerLock)/  4.,
                                    (lts[0]->failuresPerLock+lts[1]->failuresPerLock+lts[2]->failuresPerLock+lts[3]->failuresPerLock)/4.,
                                    (lts[0]->preTestsPerLock+lts[1]->preTestsPerLock+lts[2]->preTestsPerLock+lts[3]->preTestsPerLock)/4.);

  printf ("LC:%u\n", (ud)lockCounter);

  return 0;
}

The program was run on a DELL i5-4310U based computer with DDR3-800, 2 cores/2 HTs capable of 2.7GHz and a common L3 cache.

To begin with it appears the impact of WOW64 was negligible.

A single thread performing uncontended lock/unlocks was able to do so once every 110 cycles. Tuning the uncontended lock is useless: any code added to enhance the single XCHG instruction will only make it slower.

With four HTs bombarding the lock variable with lock attempts the situation changes radically. The time required to achieve a successful lock jumps to 994 cycles of which a significant part can be attributed to the 2.2 failed lock attempts. To put it another way, in a high-contention situation on average 3.2 locks must be attempted to achieve a successful lock. Obviously the 110 cycles have not become 110*3.2 but closer to 110*9. So other mechanisms are at play here just as with the tests on the older machine. Also, the average 994 cycles encompasses a range between 716 and 1157

The lock variants implementing pre-testing required about 95% of the cycles reuired by the simplest variant (XCHG). On average they would perform 17 CMPs to find it feasible to attempt 1.75 locks of which 1 was successful. I recommend using pre-testing not only because it is faster: it imposes less strain on the bus-locking mechanism (3.2-1.75=1.45 fewer lock attempts) even though it increases the complexity slightly.

Wikipedia has a good article on spinlocks, here is the x86 implementation

http://en.wikipedia.org/wiki/Spinlock#Example_implementation

Notice their implementation doesn't use the "lock" prefix, because it is redundant on x86 for the "xchg" instruction - it implicitly has lock semantics, as discussed in this Stackoverflow discussion:

On a multicore x86, is a LOCK necessary as a prefix to XCHG?

The REP:NOP is an alias for the PAUSE instruction, you can learn more about that here

How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?

On the issue of memory barriers, here's everything you might want to know

Memory Barriers: a Hardware View for Software Hackers by Paul E. McKenney

http://irl.cs.ucla.edu/~yingdi/paperreading/whymb.2010.06.07c.pdf

Just asking:

Before you dig that deep into spinlock and nearly-lockless data structures:

Have you - in your benchmarks and your application - made sure that the competing threads are guaranteed to run on different cores?

If not you may end up with a program that works great on your development machine but sucks/fails hard in the field because one thread has to be both the locker and unlocker of your spinlock.

To give you a figure: On Windows you have standard time-slice of 10 milliseconds. If you don't make sure that two physical threads are involved in locking/unlocking you'll end up with around 500 locks/unlocks per second, and this result will be very meh

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow