Implementation of a work stealing queue in C/C++?

https://stackoverflow.com/questions/2101789

21-09-2019
|

Question

I'm looking for a proper implementation of a work stealing queue in C/CPP. I've looked around Google but haven't found anything useful.

Perhaps someone is familiar with a good open-source implementation? (I prefer not to implement the pseudo-code taken from the original academic papers).

Solution

No free lunch.

Please take a look the original work stealing paper. This paper is hard to understand. I know that paper contains theoretical proof rather than pseudo code. However, there is simply no such much more simple version than TBB. If any, it won't give optimal performance. Work stealing itself incurs some amount of overhead, so optimizations and tricks are quite important. Especially, dequeues are must be thread-safe. Implementing highly scalable and low-overhead synchronizations are challenging.

I'm really wondering why you need it. I think that proper implementation means something like TBB and Cilk. Again, work stealing is hard to implement.

OTHER TIPS

Take a look at Intel's Threading Building Blocks.

http://www.threadingbuildingblocks.org/

To implement "work stealing" isn't hard in theory. You need a set of queues containing tasks that do work by doing a combination of computing and generating other tasks to do more work. And you need atomic access to the queues to place newly generated tasks into those queues. Finally, you need a procedure that each task calls at the end, to find more work for the thread that executed the task; that procedure needs to look in work queues to find work.

Most such work-stealing systems make the assumption that there are a small number of threads (backed up typically by real processor cores), and that there is a exactly one work queue per thread. Then you first try to steal work from your own queue, and if it is empty, try to steal from others. What gets tricky is knowing which queues to look in; scanning them serially for work is pretty expensive and can create a huge amount of contention between threads looking for work.

So far this is all pretty generic stuff with one two major exceptions: 1) switching contexts (e.g, setting processor context registers such as a "stack") cannot be stated in pure C or C++. You can resolve this by agreeing to write part of your package in target-platform specific machine code. 2) Atomic access to the queues for a multiprocessor cannot be done purely in C or C++ (ignoring Dekker's algorithm), and so you'll need to code those using assembly language synchronization primitives like the X86 LOCK XCH or Compare and Swap. Now, the code involved in updating the queuse once you have safe access isn't very complex, and you could easily write that in a few lines of C.

However, I think you will find is that attempting to code such a package in C and C++ with mixed assembler is still rather inefficient and you'll eventually end up coding the entire thing in assembler anyway. All's that left are C/C++ compatible entry points :-}

I did this for our PARLANSE parallel programming language, which offers the idea of an arbitrarily large number of parallel computations live and interacting (synchonizing) at any instant. It is implemented behind the scenes on an X86 exactly with one thread per CPU, and the implementation is entirely in assembler. The work-stealing code is probably 1000 lines total, and its tricky code because you want it to be extremely fast in the non-contention case.

The real fly in the ointment for C and C++ is, when you create a task representing work, how much stack space do you assign? Serial C/C++ programs avoid this question by simply overallocating huge amounts (e.g, 10Mb) of one linear stack, and nobody cares much about how much of that stack space is wasted. But if you can create thousands of tasks and have them all live at a particular instant, you can't reasonably allocate 10Mb to each one. So now you either need to determine statically how much stack space a task will need (Turing-hard), or you'll need to allocate stack chunks (e.g., per function call), which widely available C/C++ compilers don't do (e.g, the one you are likely using). THe last way out is to throttle task creation to limit it to a few hundred at any instant, and multiplex a few-hundred really huge stacks among the tasks that are live. You can't do the last if the tasks can interlock/suspend state, because you'll run into your threshold. So you can only do this if the tasks only do computation. That seems like a pretty severe constraint.

For PARLANSE, we built a compiler that allocates activation records on the heap for each function call.

There exist a tool to simply doing it in an very elegant way. It is a really effective way to parrallelize your program in a very short time.

Cilk project

HPC Challenge Award

Our Cilk entry for the HPC Challenge Class 2 award won the 2006 award for ``Best Combination of Elegance and Performance''. The award was made at SC'06 in Tampa on November 14 2006.

If you're looking for a standalone workstealing queue class implementation in C++ built on pthread or boost::thread, good luck, to my knowledge there isn't one.

However, as others have said Cilk, TBB and Microsoft's PPL all have workstealing implementations under the hood.

The question is do you want to use a workstealing queue or implement one? If you just want to use one then the choices above are good starting points simply scheduling a 'task' in any of them will suffice.

As BlueRaja said the task_group & structured_task_group in PPL will do this, also note that these classes are available in the latest version of Intel's TBB as well. The parallel loops (parallel_for, parallel_for_each) are also implemented with workstealing.

If you must look at source rather than use an implementation, TBB is OpenSource and Microsoft ships the sources for its CRT, so you can go spelunking.

You can also look on Joe Duffy's blog for a C# implementation (but it's C# and the memory model is different).

-Rick

The structured_task_group class of the PPL uses a work stealing queue for its implementation. If you need a WSQ for threading, I would recommend that.
If you are actually looking for the source, I don't know if the code is given in ppl.h or if there is a precompiled object; I will have to check when I get home tonight.

The closest implementation of this work stealing algorithm I've found is something called Wool by Karl-Filip Faxén. src / report / comparison

OpenMP may very well support work-stealing although its called recursive parallelism

OpenMP forum post

The OpenMP specification defines tasking constructs (which can be nested, so are very suitable for recursive parallelism) but does not specify the details of how they how they are implemented. OpenMP implementations, including gcc, typically use some form of work stealing for tasks, though the exact algorithm (and the resulting performance) may vary!

See #pragma omp task and #pragma omp taskwait

Update

Chapter 9 of the book C++ Concurrency in Action describes how to implement "work stealing for pool threads". I haven't read/implemented it myself but it doesn't look too difficult.

This open source library https://github.com/cpp-taskflow/cpp-taskflow supports work stealing thread pools since Dec 2018.

Take a look at the WorkStealingQueue class which implements the work stealing queue as described in the paper "Dynamic Circular Work-stealing Deque," SPAA, 2015.

Will breaking your work tasks into smaller units eliminate the need for stealing work in the first place?

I have ported this C project to C++.

The original Steal may experience a dirty read when the array is expanded. I tried to fix the bug, but eventually gave in because I didn't actually need a dynamically growing stack. Instead of trying to allocate space, the Push method simply returns false. The caller can then perform a spin-wait, i.e. while(!stack->Push(value)){}.

#pragma once
#include <atomic>

  // A lock-free stack.
  // Push = single producer
  // Pop = single consumer (same thread as push)
  // Steal = multiple consumer

  // All methods, including Push, may fail. Re-issue the request
  // if that occurs (spinwait).

  template<class T, size_t capacity = 131072>
  class WorkStealingStack {

  public:
    inline WorkStealingStack() {
      _top = 1;
      _bottom = 1;
    }

    WorkStealingStack(const WorkStealingStack&) = delete;

    inline ~WorkStealingStack()
    {

    }

    // Single producer
    inline bool Push(const T& item) {
      auto oldtop = _top.load(std::memory_order_relaxed);
      auto oldbottom = _bottom.load(std::memory_order_relaxed);
      auto numtasks = oldbottom - oldtop;

      if (
        oldbottom > oldtop && // size_t is unsigned, validate the result is positive
        numtasks >= capacity - 1) {
        // The caller can decide what to do, they will probably spinwait.
        return false;
      }

      _values[oldbottom % capacity].store(item, std::memory_order_relaxed);
      _bottom.fetch_add(1, std::memory_order_release);
      return true;
    }

    // Single consumer
    inline bool Pop(T& result) {

      size_t oldtop, oldbottom, newtop, newbottom, ot;

      oldbottom = _bottom.fetch_sub(1, std::memory_order_release);
      ot = oldtop = _top.load(std::memory_order_acquire);
      newtop = oldtop + 1;
      newbottom = oldbottom - 1;

      // Bottom has wrapped around.
      if (oldbottom < oldtop) {
        _bottom.store(oldtop, std::memory_order_relaxed);
        return false;
      }

      // The queue is empty.
      if (oldbottom == oldtop) {
        _bottom.fetch_add(1, std::memory_order_release);
        return false;
      }

      // Make sure that we are not contending for the item.
      if (newbottom == oldtop) {
        auto ret = _values[newbottom % capacity].load(std::memory_order_relaxed);
        if (!_top.compare_exchange_strong(oldtop, newtop, std::memory_order_acquire)) {
          _bottom.fetch_add(1, std::memory_order_release);
          return false;
        }
        else {
          result = ret;
          _bottom.store(newtop, std::memory_order_release);
          return true;
        }
      }

      // It's uncontended.
      result = _values[newbottom % capacity].load(std::memory_order_acquire);
      return true;
    }

    // Multiple consumer.
    inline bool Steal(T& result) {
      size_t oldtop, newtop, oldbottom;

      oldtop = _top.load(std::memory_order_acquire);
      oldbottom = _bottom.load(std::memory_order_relaxed);
      newtop = oldtop + 1;

      if (oldbottom <= oldtop)
        return false;

      // Make sure that we are not contending for the item.
      if (!_top.compare_exchange_strong(oldtop, newtop, std::memory_order_acquire)) {
        return false;
      }

      result = _values[oldtop % capacity].load(std::memory_order_relaxed);
      return true;
    }

  private:

    // Circular array
    std::atomic<T> _values[capacity];
    std::atomic<size_t> _top; // queue
    std::atomic<size_t> _bottom; // stack
  };

Full Gist (including unit tests). I have only run the tests on a strong architecture (x86/64), so as far as weak architectures go your mileage may vary if you try to use this on e.g. Neon/PPC.

I don't think JobSwarm uses work stealing, but it's a first step. I'm not aware of other open source libraries for this purpose.

dunno if this would be of any help to you, but have a look at this article on AMD's developer network, its simple, but should give you something useful

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow