How can doing tasks in multiple threads be 100 times slower than doing sequentially on the main thread?

StackOverflow https://stackoverflow.com/questions/23507721

質問

I have this other question of mine where I have asked about converting a code from sequential to parallel processing using Grand Central Dispatch.

I will copy the question text to makes things easy...


I have an array of NSNumbers that have to pass thru 20 tests. If one test fails than the array is invalid if all tests pass than the array is valid. I am trying to do it in a way that as soon as the first failure happens it stops doing the remaining tests. If a failure happens on the 3rd test then stop evaluating other tests.

Every individual test returns YES when it fails and NO when it is ok.

I am trying to convert the code I have that is serial processing, to parallel processing with grand central dispatch, but I cannot wrap my head around it.

This is what I have.

First the definition of the tests to be done. This array is used to run the tests.

#define TESTS  @[         \
    @"averageNotOK:",     \
    @"numbersOverRange:", \
    @"numbersUnderRange:",\
    @"numbersForbidden:", \
    // ... etc etc
    @"numbersNotOnCurve:"]


- (BOOL) numbersPassedAllTests:(NSArray *)numbers {

  NSInteger count = [TESTS count];

  for (int i=0; i<count; i++) {

    NSString *aMethodName = TESTS[i];

        SEL selector = NSSelectorFromString(aMethodName);

        BOOL failed = NO;

        NSMethodSignature *signature = [[self class] instanceMethodSignatureForSelector:selector];

        NSInvocation *invocation = [NSInvocation invocationWithMethodSignature:signature];
        [invocation setSelector:selector];
        [invocation setTarget:self];
        [invocation setArgument:&numbers atIndex:2];
        [invocation invoke];


        [invocation getReturnValue:&failed];

        if (failed) {
          return NO;
        }
  }
  return YES;

}

This work perfectly but performs the tests sequentially.

After working on the code with the help of an user, I got this code using grand central dispatch:

- (BOOL) numbersPassedAllTests:(NSArray *)numbers {

  volatile __block int32_t hasFailed = 0;

  NSInteger count = [TESTS count];

  __block NSArray *numb = [[NSArray alloc] initWithArray:numbers];

  dispatch_apply(
                 count,
                 dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0),
                 ^(size_t index)
                 {
                   // do no computation if somebody else already failed
                   if(hasFailed) {
                    return;
                   }


                   SEL selector = NSSelectorFromString(TESTS[index]);

                   BOOL failed = NO;

                   NSMethodSignature *signature = [[self class] instanceMethodSignatureForSelector:selector];

                   NSInvocation *invocation = [NSInvocation invocationWithMethodSignature:signature];
                   [invocation setSelector:selector];
                   [invocation setTarget:self];
                   [invocation setArgument:&numb atIndex:2];
                   [invocation invoke];

                   [invocation getReturnValue:&failed];

                   if(failed)
                     OSAtomicIncrement32(&hasFailed);
                 });

  return !hasFailed;
}

Activity Monitor shows what appears to be the cores being used with more intensity but this code is at least 100 times slower than the older one working sequentially!

How can that be?

役に立ちましたか?

解決

If your methods that you're calling are simple, the overhead of creating all of these threads could offset any advantage gained by concurrency. As the Performing Loop Iterations Concurrently section of the Concurrency Programming Guide says:

You should make sure that your task code does a reasonable amount of work through each iteration. As with any block or function you dispatch to a queue, there is overhead to scheduling that code for execution. If each iteration of your loop performs only a small amount of work, the overhead of scheduling the code may outweigh the performance benefits you might achieve from dispatching it to a queue. If you find this is true during your testing, you can use striding to increase the amount of work performed during each loop iteration. With striding, you group together multiple iterations of your original loop into a single block and reduce the iteration count proportionately. For example, if you perform 100 iterations initially but decide to use a stride of 4, you now perform 4 loop iterations from each block and your iteration count is 25. For an example of how to implement striding, see “Improving on Loop Code.”

That link to Improving on Loop Code walks through a sample implementation of striding, whereby you balance the number of threads with the amount of work done by each. It will take some experimentation to find the right balance with your methods, so play around with different striding values until you achieve the best performance.

In my experiments with a CPU-bound process, I found that I achieved a huge gain when doing two threads, but it diminished after that point. It may vary based upon what is in your methods that you're calling.


By the way, what are these methods that you're calling doing? If you're doing anything that requires the main thread (e.g. UI updates), that will also skew the results. For the sake of comparison, I'd suggest you take your serial example and dispatch that to a background queue (as a single task), and see what sort of performance you get that way. This way you can differentiate between main vs. background queue related issues, and the too-many-threads overhead issue I discuss above.

他のヒント

Parallel computing only makes sense if you have enough tasks for each node to do. Otherwise, the extra overhead of setting up/managing the parallel nodes takes up more time than the problem itself.

Example of bad parallelization:

void function(){
for(int i = 0; i < 1000000; ++i){
  for(int j = 0; j < 1000000; ++j){
    ParallelAction{ //Turns the following code into a thread to be done concurrently.
      print(i + ", " + j)
    }
  }
}

Problem: every print() statement has to be turned into a thread, where a worker node has to initialize, acquire the thread, finish, and find a new thread.

Essentially, you've got 1 000 000 * 1 000 000 threads waiting for a node to work on them.

How to make the above better:

void function(){
for(int i = 0; i < 1000000; ++i){
  ParallelAction{ //Turns the following code into a thread to be done concurrently.
    for(int j = 0; j < 1000000; ++j){
      print(i + ", " + j)
    }
  }
}

This way, every node can start up, do a sizeable amount of work (print 1 000 000 things), finish up, and find a new job.

http://en.wikipedia.org/wiki/Granularity

The above link talks about granularity, the amount breaking up of a problem that you do.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top