x86 equivalent for LWARX and STWCX

https://stackoverflow.com/questions/1147904

18-09-2019
|

Question

I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).

Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.

int* IncrementAndRetrieve(int **ptr)
{
  int val;
  int *pval;
  do
  {
    // fetch the pointer to the value
    pval = *ptr;

    // if its NULL, then just return NULL, the smart pointer
    // will then become NULL as well
    if(pval == NULL)
      return NULL;

    // Grab the reference count
    val = lwarx(pval);

    // make sure the pointer we grabbed the value from
    // is still the same one referred to by  'ptr'
    if(pval != *ptr)
      continue;

    // Increment the reference count via 'stwcx' if any other threads
    // have done anything that could potentially break then it should
    // fail and try again
  } while(!stwcx(pval, val + 1));
  return pval;
}

I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).

Thanks

Solution

As Michael mentioned, what you're probably looking for is the cmpxchg instruction.

It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).

If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.

OTHER TIPS

x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).

You're probably looking for the cmpxchg family of instructions.

You'll need to precede these with a lock instruction to get equivalent behaviour.

Have a look here for a quick overview of what's available.

You'll likely end up with something similar to this:

mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12

You should read this paper...

Edit

In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.

if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.

int* IncrementAndRetrieve(int **ptr)
{
  int val;
  int *unpacked;
  do
  {   
    val = *ptr;
    unpacked = unpack(val);

    if(unpacked == NULL)
      return NULL;
    // pointer is on the bottom
  } while(!cas(unpacked, val, val + 1));
  return unpacked;
}

Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.

Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.

What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).

The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow