Lockless Deque in Win32 C++

https://stackoverflow.com/questions/1835106

11-09-2019
|

Question

I'm pretty new to lockless data structures, so for an exercise I wrote (What I hope functions as) a bounded lockless deque (No resizing yet, just want to get the base cases working). I'd just like to have some confirmation from people who know what they're doing as to whether I've got the right idea and/or how I might improve this.

class LocklessDeque
{
  public:

    LocklessDeque() : m_empty(false),
                      m_bottom(0),
                      m_top(0) {}


    ~LocklessDeque()
    {
      // Delete remaining tasks
      for( unsigned i = m_top; i < m_bottom; ++i )
        delete m_tasks[i];
    }


    void PushBottom(ITask* task)
    {
      m_tasks[m_bottom] = task;

      InterlockedIncrement(&m_bottom);
    }


    ITask* PopBottom()
    {
      if( m_bottom - m_top > 0 )
      {
        m_empty = false;

        InterlockedDecrement(&m_bottom);

        return m_tasks[m_bottom];
      }

      m_empty = true;

      return NULL;
    }


    ITask* PopTop()
    {
      if( m_bottom - m_top > 0 )
      {
        m_empty = false;

        InterlockedIncrement(&m_top);

        return m_tasks[m_top];
      }

      m_empty = true;

      return NULL;
    }


    bool IsEmpty() const
    {
      return m_empty;
    }

  private:

    ITask* m_tasks[16];

    bool m_empty;

    volatile unsigned m_bottom;
    volatile unsigned m_top;

};

Solution

Looking at this I would think this would be a problem:

void PushBottom(ITask* task)
{
  m_tasks[m_bottom] = task;
  InterlockedIncrement(&m_bottom);
}

If this is used in an actual multithreaded environment I would think you'd collide when setting m_tasks[m_bottom]. Think of what would happen if you have two threads trying to do this at the same time - you couldn't be sure of which one actually set m_tasks[m_bottom].

Check out this article which is a reasonable discussion of a lock-free queue.

OTHER TIPS

Your use of the m_bottom and m_top members to index the array is not okay. You can use the return value of InterlockedXxxx() to get a safe index. You'll need to lose IsEmpty(), it can never be accurate in a multi-threading scenario. Same problem with the empty check in PopXxx. I don't see how you could make that work without a mutex.

The key to doing almost impossible stuff like this is to use InterlockedCompareExchange. (This is the name Win32 uses but any multithreaded-capable platform will have an InterlockedCompareExchange equivalent).

The idea is, you make a copy of the structure (which must be small enough to perform an atomic read (64 or if you can handle some unportability, 128 bits on x86).

You make another copy with your proposed update, do your logic and update the copy, then you update the "real" structure using InterlockedCompareExchange. What InterlockedCompareExchange does is, atomically make sure the value is still the value you started with before your state update, and if it is still that value, atomically updates the value with the new state. Generally this is wrapped in an infinite loop that keeps trying until someone else hasn't changed the value in the middle. Here is roughly the pattern:

union State
{
    struct
    {
        short a;
        short b;
    };
    uint32_t volatile rawState;
} state;

void DoSomething()
{
    // Keep looping until nobody else changed it behind our back
    for (;;)
    {
        state origState;
        state newState;

        // It's important that you only read the state once per try
        origState.rawState = state.rawState;
        // This must copy origState, NOT read the state again
        newState.rawState = origState.rawState;

        // Now you can do something impossible to do atomically...
        // This example takes a lot of cycles, there is huge
        // opportunity for another thread to come in and change
        // it during this update
        if (newState.b == 3 || newState.b % 6 != 0)
        {
            newState.a++;
        }

        // Now we atomically update the state,
        // this ONLY changes state.rawState if it's still == origState.rawState
        // In either case, InterlockedCompareExchange returns what value it has now
        if (InterlockedCompareExchange(&state.rawState, newState.rawState,
                origState.rawState) == origState.rawState)
            return;
    }
}

(Please forgive if the above code doesn't actually compile - I wrote it off the top of my head)

Great. Now you can make lockless algorithms easy. WRONG! The trouble is that you are severely limited on the amount of data that you can update atomically.

Some lockless algorithms use a technique where they "help" concurrent threads. For example, say you have a linked list that you want to be able to update from multiple threads, other threads can "help" by performing updates to the "first" and "last" pointers if they are racing through and see that they are at the node pointed to by "last" but the "next" pointer in the node is not null. In this example, upon noticing that the "last" pointer is wrong, they update the last pointer, only if it still points to the current node, using an interlocked compare exchange.

Don't fall into a trap where you "spin" or loop (like a spinlock). While there is value in spinning briefly because you expect the "other" thread to finish something - they may not. The "other" thread may have been context switched and may not be running anymore. You are just eating CPU time, burning electricity (killing a laptop battery perhaps) by spinning until a condition is true. The moment you begin to spin, you might as well chuck your lockless code and write it with locks. Locks are better than unbounded spinning.

Just to go from hard to ridiculous, consider the mess you can get yourself into with other architectures - things are generally pretty forgiving on x86/x64, but when you get into other "weakly ordered" architectures, you get into territory where things happen that make no sense - memory updates won't happen in program order, so all your mental reasoning about what the other thread is doing goes out the window. (Even x86/x64 have a memory type called "write combining" which is often used when updating video memory but can be used for any memory buffer hardware, where you need fences) Those architectures require you to use "memory fence" operations to guarantee that all reads/writes/both before the fence will be globally visible (by other cores). A write fence guarantees that any writes before the fence will be globally visible before any writes after the fence. A read fence will guarantee that no reads after the fence will be speculatively executed before the fence. A read/write fence (aka full fence or memory fence) will make both guarantees. Fences are very expensive. (Some use the term "barrier" instead of "fence")

My suggestion is to implement it first with locks/condition variables. If you have any trouble with getting that working perfectly, it's hopeless to attempt doing a lockless implementation. And always measure, measure, measure. You'll probably find the performance of the implementation using locks is perfectly fine - without the incertainty of some flaky lockless implementation with a natsy hang bug that will only show up when you're doing a demo to an important customer. Perhaps you can fix the problem by redefining the original problem into something more easily solved, perhaps by restructuring the work so bigger items (or batches of items) are going into the collection, which reduces the pressure on the whole thing.

Writing lockless concurrent algorithms is very difficult (as you've seen written 1000x elsewhere I'm sure). It is often not worth the effort either.

Addressing the problem pointed out by Aaron, I'd do something like:

void PushBottom(ITask *task) { 
    int pos = InterlockedIncrement(&m_bottom);
    m_tasks[pos] = task;
}

Likewise, to pop:

ITask* PopTop() {
  int pos = InterlockedIncrement(&m_top);
  if (pos == m_bottom) // This is still subject to a race condition.
      return NULL;
  return m_tasks[pos];
}

I'd eliminate both m_empty and IsEmpty() from the design completely. The result returned by IsEmpty is subject to a race condition, meaning by the time you look at that result, it may well be stale (i.e. what it tells you about the queue may be wrong by the time you look at what it returned). Likewise, m_empty provides nothing but a record of information that's already available without it, a recipe for producing stale data. Using m_empty doesn't guarantee it can't work right, but it does increase the chances of bugs considerably, IMO.

I'm guessing it's due to the interim nature of the code, but right now you also have some serious problems with the array bounds. You aren't doing anything to force your array indexes to wrap around, so as soon as you try to push the 17th task onto the queue, you're going to have a major problem.

Edit: I should point out that the race condition noted in the comment is quite serious -- and it's not the only one either. Although somewhat better than the original, this should not be mistaken for code that will work correctly.

I'd say that writing correct lock-free code is considerably more difficult than writing correct code that uses locks. I don't know of anybody who has done so without a solid understanding of code that does use locking. Based on the original code, I'd say it would be much better to start by writing and understanding code for a queue that does use locks, and only when you've used that to gain a much better understanding of the issues involved really make an attempt at lock-free code.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow