OpenMP and cores/threads

https://stackoverflow.com/questions/9293285

29-04-2021
|

Question

My CPU is a Core i3 330M with 2 cores and 4 threads. When I execute the command cat /proc/cpuinfo in my terminal, it is like I have 4 CPUS. When I use the OpenMP function get_omp_num_procs() I also get 4.

Now I have a standard C++ vector class, I mean a fixed-size double array class that does not use expression templates. I have carefully parallelized all the methods of my class and I get the "expected" speedup.

The question is: can I guess the expected speedup in such a simple case? For instance, if I add two vectors without parallelized for-loops I get some time (using the shell time command). Now if I use OpenMP, should I get a time divided by 2 or 4, according to the number of cores/threads? I emphasize that I am only asking for this particular simple problem, where there is no interdependence in the data and everything is linear (vector addition).

Here is some code:

Vector Vector::operator+(const Vector& rhs) const
{
    assert(m_size == rhs.m_size);
    Vector result(m_size);
    #pragma omp parallel for schedule(static)
    for (unsigned int i = 0; i < m_size; i++) 
            result.m_data[i] = m_data[i]+rhs.m_data[i];

    return result;
}

I have already read this post: OpenMP thread mapping to physical cores.

I hope that somebody will tell me more about how OpenMP get the work done in this simple case. I should say that I am a beginner in parallel computing.

Thanks!

Solution

EDIT : Now that some code has been added.

In that particular example, there is very little computation and lots of memory access. So the performance will depend heavily on:

The size of the vector.
How you are timing it. (do you have an outer-loop for timing purposes)
Whether the data is already in cache.

For larger vector sizes, you will likely find that the performance is limited by your memory bandwidth. In which case, parallelism is not going to help much. For smaller sizes, the overhead of threading will dominate. If you're getting the "expected" speedup, you're probably somewhere in between where the result is optimal.

I refuse to give hard numbers because in general, "guessing" performance, especially in multi-threaded applications is a lost cause unless you have prior testing knowledge or intimate knowledge of both the program and the system that it's running on.

Just as a simple example taken from my answer here: How to get 100% CPU usage from a C program

On a Core i7 920 @ 3.5 GHz (4 cores, 8 threads):

If I run with 4 threads, the result is:

This machine calculated all 78498 prime numbers under 1000000 in 39.3498 seconds

If I run with 4 threads and explicitly (using Task Manager) pin the threads on 4 distinct physical cores, the result is:

This machine calculated all 78498 prime numbers under 1000000 in 30.4429 seconds

So this shows how unpredictable it is for even a very simple and embarrassingly parallel application. Applications involving heavy memory usage and synchronization get a lot uglier...

OTHER TIPS

To add to Mysticals answer. Your problem is purely memory bandwidth bounded. Have a look at the STREAM benchmark. Run it on your computer in single and multi-threaded cases, and look at the Triad results - this is your case (well, almost, since your output vector is at the same time one of your input vectors). Calculate how much data you move around and You will know exactly what performance to expect.

Does multi-threading work for this problem? Yes. It is rare that a single CPU core can saturate the entire memory bandwidth of the system. Modern computers balance the available memory bandwidth with the number of cores available. From my experience you will need around half of the cores to saturate the memory bandwidth with a simple memcopy operation. It might take a few more if you do some calculations on the way.

Note that on NUMA systems you will need to bind the threads to cpu cores and use local memory allocation to get optimal results. This is because on such systems every CPU has its own local memory, to which the access is the fastest. You can still access the entire system memory like on usual SMPs, but this incurs communication cost - CPUs have to explicitly exchange data. Binding threads to CPUs and using local allocation is extremely important. Failing to do this kills the scalability. Check libnuma if you want to do this on Linux.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow