How can doing tasks in multiple threads be 100 times slower than doing sequentially on the main thread?

Question 1

If your methods that you're calling are simple, the overhead of creating all of these threads could offset any advantage gained by concurrency. As the Performing Loop Iterations Concurrently section of the Concurrency Programming Guide says:

You should make sure that your task code does a reasonable amount of work through each iteration. As with any block or function you dispatch to a queue, there is overhead to scheduling that code for execution. If each iteration of your loop performs only a small amount of work, the overhead of scheduling the code may outweigh the performance benefits you might achieve from dispatching it to a queue. If you find this is true during your testing, you can use striding to increase the amount of work performed during each loop iteration. With striding, you group together multiple iterations of your original loop into a single block and reduce the iteration count proportionately. For example, if you perform 100 iterations initially but decide to use a stride of 4, you now perform 4 loop iterations from each block and your iteration count is 25. For an example of how to implement striding, see “Improving on Loop Code.”

That link to Improving on Loop Code walks through a sample implementation of striding, whereby you balance the number of threads with the amount of work done by each. It will take some experimentation to find the right balance with your methods, so play around with different striding values until you achieve the best performance.

In my experiments with a CPU-bound process, I found that I achieved a huge gain when doing two threads, but it diminished after that point. It may vary based upon what is in your methods that you're calling.

By the way, what are these methods that you're calling doing? If you're doing anything that requires the main thread (e.g. UI updates), that will also skew the results. For the sake of comparison, I'd suggest you take your serial example and dispatch that to a background queue (as a single task), and see what sort of performance you get that way. This way you can differentiate between main vs. background queue related issues, and the too-many-threads overhead issue I discuss above.

Question 2

Parallel computing only makes sense if you have enough tasks for each node to do. Otherwise, the extra overhead of setting up/managing the parallel nodes takes up more time than the problem itself.

Example of bad parallelization:

void function(){
for(int i = 0; i < 1000000; ++i){
  for(int j = 0; j < 1000000; ++j){
    ParallelAction{ //Turns the following code into a thread to be done concurrently.
      print(i + ", " + j)
    }
  }
}

Problem: every print() statement has to be turned into a thread, where a worker node has to initialize, acquire the thread, finish, and find a new thread.

Essentially, you've got 1 000 000 * 1 000 000 threads waiting for a node to work on them.

How to make the above better:

void function(){
for(int i = 0; i < 1000000; ++i){
  ParallelAction{ //Turns the following code into a thread to be done concurrently.
    for(int j = 0; j < 1000000; ++j){
      print(i + ", " + j)
    }
  }
}

This way, every node can start up, do a sizeable amount of work (print 1 000 000 things), finish up, and find a new job.

http://en.wikipedia.org/wiki/Granularity

The above link talks about granularity, the amount breaking up of a problem that you do.