Performance of concurrent code using dispatch_group_async is MUCH slower than single-threaded version

Question 1

arc4random uses a critical section while modifying its state. The critical section is super-fast in the non-contended case (when changing from unlocked to locked), but in the contended case (when trying to lock a mutex that's already locked) it has to call the operating system and put the thread to sleep, which decreases performance much.

u_int32_t
arc4random()
{
    u_int32_t rnd;

    THREAD_LOCK();
    arc4_check_init();
    arc4_check_stir();
    rnd = arc4_getword(&rs);
    THREAD_UNLOCK();

    return (rnd);
}

where THREAD_LOCK() is defined as

#define THREAD_LOCK()                       \
    do {                            \
        if (__isthreaded)               \
            _pthread_mutex_lock(&arc4random_mtx);   \
    } while (0)

Source: Arc4 random number generator for OpenBSD

to make it faster

you could create a Arc4Random class that is a wrapper around the static arc4_* functions from arc4random.c . Then you have a random-number generator that is no longer threadsafe, but you could create one generator for each thread.

Question 2

This is speculation, so I can't confirm it one way or another without actually profiling the code (as it goes).

That said, arc4random locks for each call, going by Apple's collection of source code. Because you're using arc4random_uniform potentially across multiple threads, then, you're calling that at least once if not multiple times. So my best guess here is simply that each task is waiting on all other tasks' calls to arc4random_uniform (or _uniform may be in turn waiting on itself if multiple calls are initiated in parallel and multiple calls to arc4random are necessary).

The easiest way to fix this might be to simply pull the existing arc4random.c source code and modify it to either be wrapped in a class while removing synchronization from it (as I suggested in chat, or as Michael suggested) or to make use of thread-local storage (this fixes the thread-safety issue but may be just as slow — haven't tried it myself, so mountain of salt). Bear in mind that if you do go either route, you will need an alternative to accessing /dev/random on iOS. I'd recommend using SecRandomCopyBytes in that case, as it should yield the same or just as good results as reading from /dev/random yourself.

So, while I'm pretty sure it's down to arc4random, I can't say for sure without profiling because there may be other things causing performance issues even before arc4random starts doing its thing.

Question 3

Ok, thanks to Michael and Noel for your thoughtful responses.

Indeed it seems that arc4random() and arc4random_uniform() use a variant of a spin_lock, and performance is horrible in multi-threaded use.

It makes sense that a spin-lock is a really bad choice in a case where there are a lot of collisions, because a spin-lock causes the thread to block until the lock is released, thus tying up that core.

The ideal would be to create my own version of arc4random that maintains it's own state array in instance variables and is not thread-safe would probably be the best solution. I would then refactor my app to create a separate instance of my random generator for each thread.

However, this is a side-project for my own research. That's more effort than I'm prepared to expend if I'm not getting paid.

As an experiment, I replaced the code with rand(), and the single-threaded case is quite a bit faster, since rand() is a simpler, faster algorithm. The random numbers aren't as good either. From what I've read, rand() has problems with cyclic patterns in the lower bits, so instead of using the typical rand()%2, I used rand()%0x4000 instead, to use the second-to-highest order bit instead.

However, performance still decreased dramatically when I tried to use rand() in my multi-threaded code. It must use locking internally as well.

I then switched to rand_r(), which takes a pointer to a seed value, assuming that since it is stateless, it probably does not use locking.

Bingo. I now get 415,674 points/second running on my 8-core Mac Pro.