Why is a single thread faster than multiple threads even though they essentially have the same overhead?

StackOverflow https://stackoverflow.com/questions/15673524

Domanda

I am running a 64bit Windows 7 on an 8 core processor. I ran the following:

    #include "stdafx.h"
    #include <iostream>
    #include <Windows.h>
    #include <process.h>
    #include <ctime>

    using namespace std;

    int count = 0;
    int t = time(NULL);

    //poop() loops incrementing count until it is 300 million.
    void poop(void* params) {
        while(count < 300000000) {
            count++;
        }


        cout<< time(NULL) - t <<" \n";
    }

    int _tmain(int argc, _TCHAR* argv[])
    {
        //_beginthread(poop, 0, NULL);      
        //_beginthread(poop, 0, NULL);
        poop(NULL);

        cout<<"done"<<endl;

        while(1);

        return 0;
    }

I compared the result to when I uncommented the beginThread's. It turns out the single thread version accomplishes this the fastest! Actually, adding more threads makes the process take even longer. Making the count 300 million made the process take 8+ seconds, which I figured was good enough to rule out the function calls for beginThread + other minor overhead.

I did a little research and the general conclusion for a multithread process to be slower is the overhead. But in this case, whether I run multiple threads or a single one, the number of times the variable count (which exists in the data segment because it's a preallocated variable afaik) is accessed is equal. So basically, the overhead (if it is an overhead issue) is not coming from the fact that it costs more to access a global variable than a local variable.

Looking at my task manager, the process with a single thread is using 13% cpu (about 1/8 cores) and adding threads increases the cpu usage in increments of 1/8 approx. So in terms of cpu power, assuming task manager is accurately depicting this, adding threads uses more cpu. Which further confuses me.. how is it that I'm using more overall cpu, with separate cores, but overall taking longer to accomplish the task?

TLDR: WHY DOES THIS HAPPEN

È stato utile?

Soluzione

Your code is inherently wrong.

count++ is a three-step operation that reads the value, increments it, then stores it back into the variable.
If two threads run count++ at once on the same variable, one of them will overwrite the other one's changes.

Therefore, the multi-threaded version will end up doing extra work as each thread clobbers the other threads' progress.

If you make count a local variable, the timing should look much more normal.

Alternatively, you could use an interlocked increment, which is thread-safe, but has extra overhead to synchronize across threads.

Altri suggerimenti

As some of the commenters on your original question have pointed out you have a correctness and performance issue. Firstly all your threads are accessing count concurrently. This means that there is no guarentee that the threads will actually all count to 300 million. You can solve this correctness bug by declaring count within your poop function

void poop(void* params) {
    int count  = 0;
    while(count < 300000000) {
        count++;
    }
    cout<< time(NULL) - t <<" \n";
}

Note that this is not a problem for t because it is only read, not written, by the threads. However it is a problem with cout as you're also writing to that from multiple threads.

In addition, as pointed out in the comments, all your threads are accessing a single memory location. This means that when a thread updates count the cache line that holds it must be flushed and reloaded. This is very inefficient memory access. Typically this happens when you are accessing consecutive elements in an array, rather than a single variable (bad idea, see above). The solution to this would be to pad your array to ensure that each entry is an exact multiple of your L1 cache line size, this is obviously somewhat specific to your target processor. Another option would be to restructure your algorithm so that either; each thread processed a large block of consecutive elements, or each thread access elements in a such a way that thread do not access adjacent locations.

As you're using Windows you might want to consider using a higher level of abstraction for your code, rather than the Win32 threads functions. The Parallel Patterns Library fits the bill here (as does Intel's Threaded Building Blocks).

    concurrency::parallel_invoke(
        [=] { poop(nullptr); },
        [=] { poop(nullptr); }
    );

This allows the PPL to schedule your tasks on a thread pool, rather than your application having to explicitly create threads.

You might also consider that for really small tasks the overhead of starting additional threads may outweigh the gains of running in parallel.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top