문제

I've been doing some experimentation lately on using large numbers of random numbers to generate "normal distribution" bell curves.

The approach is simple:

  • Create an array of integers and zero it out. (I'm using 2001 integers)
  • Repeatedly calculate indexes in this array and index that entry in the array, as follows
    • Loop either 999 or 1000 times. On each iteration:
      • Seed an array index with the center value (1000)
      • Generate a random number = +1/-1. and add it to the array index
      • at the end of the loop, increment the value at the calculated array index.

Since random values 0/1 tend to occur about as frequently, the end index value from the inner loop above tend to stay close to the center value. Index values much larger/smaller than the starting value are increasingly unusual.

After a large number of repetitions, the values in the array take on the shape of a normal distribution bell curve. However, the high-quality random function arc4random_uniform() that I'm using is fairly slow, and it takes a LOT of iterations to generate a smooth curve.

I wanted to plot 1,000,000,000 (one billion) points. Running on the main thread, that takes about 16 hours.

I decided to rewrite the calculation code to use dispatch_async, and run it on my 8-core Mac Pro.

I ended up using dispatch_group_async() to submit 8 blocks, with a dispatch_group_notify() to notify the program when all the blocks have finished processing.

For simplicity on the first pass, all 8 blocks write to the same array of NSUInteger values. There is a small chance of a race condition on a read/modify write to one of the array entries, but in that case, that would simply cause one value to be lost. I was planning on adding a lock to the array increment later (or perhaps even creating separate arrays in each block, and then summing them after.)

Anyway, I refactored the code to use dispatch_group_async() and calculate 1/8 of the total values in each block, and set my code off to run. To my utter befuddlement, the concurrent code, while it maxes out all of the cores on my Mac, runs MUCH slower than the single-threaded code.

When run on a single thread, I get about 17,800 points plotted per second. When run using dispatch_group_async, the performance drops to more like 665 points/second, or about 1/26 as fast. I've varied the number of blocks I submit - 2, 4, or 8, it doesn't matter. Performance is awful. I've also tried simply submitting all 8 blocks using dispatch_async with no dispatch_group. That doesn't matter either.

There's currently no blocking/locking going on: All the blocks run at full speed. I am utterly mystified as to why the concurrent code runs slower.

The code is a little muddled now because I refactored it to work either single-threaded or concurrently so I could test.

Here's the code that runs the calculations:

randCount = 2;
#define K_USE_ASYNC 1

#if K_USE_ASYNC
    dispatch_queue_t highQ = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0);
    //dispatch_group_async

    dispatch_group_t aGroup = dispatch_group_create();
    int totalJobs = 8;
    for (int i = 0; i<totalJobs; i++)
    {
      dispatch_group_async(aGroup,
                           highQ,
                           ^{
                             [self calculateNArrayPoints: KNumberOfEntries /totalJobs
                                          usingRandCount: randCount];
                           });
    }


    dispatch_group_notify(aGroup,
                          dispatch_get_main_queue(),
                          allTasksDoneBlock
                          );
#else
    [self calculateNArrayPoints: KNumberOfEntries
                 usingRandCount: randCount];
    allTasksDoneBlock();
#endif

And the common calculation method, which is used by both the single-threaded and the concurrent version:

+ (void) calculateNArrayPoints: (NSInteger) pointCount usingRandCount: (int) randCount;
{
  int entry;
  int random_index;

  for (entry =0; entry<pointCount; entry++)
  {
    static int processed = 0;
    if (entry != 0 && entry%100000 == 0)
    {
      [self addTotTotalProcessed: processed];
      processed = 0;
    }

    //Start with a value of 1000 (center value)
    int value = 0;

    //For each entry, add +/- 1 to the value 1000 times.
    int limit  = KPinCount;
    if (randCount==2)
      if (arc4random_uniform(2) !=0)
        limit--;
    for (random_index = 0; random_index<limit; random_index++)
    {
      int random_value = arc4random_uniform(randCount);
      /*
       if 0, value--
       if 1, no change
       if 2, value++
       */
      if (random_value == 0)
        value--;
      else if (random_value == randCount-1)
        value++;
    }
    value += 1000;
    _bellCurveData[value] += 1;
    //printf("\n\nfinal value = %d\n", value);
    processed++;
  }
}

This is a quick-and-dirty learning project. It runs on both Mac and iOS, so it uses a shared utilities class. The utilities class is nothing but class methods. There is no instance of the utilities method created. It has an embarrassing number of global variables. If I end up doing anything useful with the code, I'll refactor it to create a utilities singleton, and convert all the globals to instance variables on the singleton.

For now, it works, and the hideous use of globals doesn't affect the outcome, so I'm leaving it.

The code that uses the "processed" variable is just a way of figuring out how many points have been calculated when run in concurrent mode. I added that code after I discovered the horrid performance of the concurrent version, so I'm confident it isn't a cause of the slowdown.

I'm stumped here. I've written a fair amount of concurrent code, and this task is an "embarrassingly parallel" problem, so there's no reason it shouldn't run at full tilt on all available cores.

Does anybody else see anything that might cause this, or have any other insights to offer?

도움이 되었습니까?

해결책

arc4random uses a critical section while modifying its state. The critical section is super-fast in the non-contended case (when changing from unlocked to locked), but in the contended case (when trying to lock a mutex that's already locked) it has to call the operating system and put the thread to sleep, which decreases performance much.

u_int32_t
arc4random()
{
    u_int32_t rnd;

    THREAD_LOCK();
    arc4_check_init();
    arc4_check_stir();
    rnd = arc4_getword(&rs);
    THREAD_UNLOCK();

    return (rnd);
}

where THREAD_LOCK() is defined as

#define THREAD_LOCK()                       \
    do {                            \
        if (__isthreaded)               \
            _pthread_mutex_lock(&arc4random_mtx);   \
    } while (0)

Source: Arc4 random number generator for OpenBSD

to make it faster

you could create a Arc4Random class that is a wrapper around the static arc4_* functions from arc4random.c . Then you have a random-number generator that is no longer threadsafe, but you could create one generator for each thread.

다른 팁

This is speculation, so I can't confirm it one way or another without actually profiling the code (as it goes).

That said, arc4random locks for each call, going by Apple's collection of source code. Because you're using arc4random_uniform potentially across multiple threads, then, you're calling that at least once if not multiple times. So my best guess here is simply that each task is waiting on all other tasks' calls to arc4random_uniform (or _uniform may be in turn waiting on itself if multiple calls are initiated in parallel and multiple calls to arc4random are necessary).

The easiest way to fix this might be to simply pull the existing arc4random.c source code and modify it to either be wrapped in a class while removing synchronization from it (as I suggested in chat, or as Michael suggested) or to make use of thread-local storage (this fixes the thread-safety issue but may be just as slow — haven't tried it myself, so mountain of salt). Bear in mind that if you do go either route, you will need an alternative to accessing /dev/random on iOS. I'd recommend using SecRandomCopyBytes in that case, as it should yield the same or just as good results as reading from /dev/random yourself.

So, while I'm pretty sure it's down to arc4random, I can't say for sure without profiling because there may be other things causing performance issues even before arc4random starts doing its thing.

Ok, thanks to Michael and Noel for your thoughtful responses.

Indeed it seems that arc4random() and arc4random_uniform() use a variant of a spin_lock, and performance is horrible in multi-threaded use.

It makes sense that a spin-lock is a really bad choice in a case where there are a lot of collisions, because a spin-lock causes the thread to block until the lock is released, thus tying up that core.

The ideal would be to create my own version of arc4random that maintains it's own state array in instance variables and is not thread-safe would probably be the best solution. I would then refactor my app to create a separate instance of my random generator for each thread.

However, this is a side-project for my own research. That's more effort than I'm prepared to expend if I'm not getting paid.

As an experiment, I replaced the code with rand(), and the single-threaded case is quite a bit faster, since rand() is a simpler, faster algorithm. The random numbers aren't as good either. From what I've read, rand() has problems with cyclic patterns in the lower bits, so instead of using the typical rand()%2, I used rand()%0x4000 instead, to use the second-to-highest order bit instead.

However, performance still decreased dramatically when I tried to use rand() in my multi-threaded code. It must use locking internally as well.

I then switched to rand_r(), which takes a pointer to a seed value, assuming that since it is stateless, it probably does not use locking.

Bingo. I now get 415,674 points/second running on my 8-core Mac Pro.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top