Вопрос

I've read some other questions on this topic. However, they didn't solve my problem anyway.

I wrote the code as following and I got pthread version and omp version both slower than the serial version. I'm very confused.

Compiled under environment:

Ubuntu 12.04 64bit 3.2.0-60-generic
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1

CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Vendor ID:             AuthenticAMD
CPU family:            18
Model:                 1
Stepping:              0
CPU MHz:               800.000
BogoMIPS:              3593.36
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
NUMA node0 CPU(s):     0,1

Compile command:

g++ -std=c++11 ./eg001.cpp -fopenmp

#include <cmath>
#include <cstdio>
#include <ctime>
#include <omp.h>
#include <pthread.h>

#define NUM_THREADS 5
const int sizen = 256000000;

struct Data {
    double * pSinTable;
    long tid;
};

void * compute(void * p) {
    Data * pDt = (Data *)p;
    const int start = sizen * pDt->tid/NUM_THREADS;
    const int end = sizen * (pDt->tid + 1)/NUM_THREADS;
    for(int n = start; n < end; ++n) {
        pDt->pSinTable[n] = std::sin(2 * M_PI * n / sizen);
    }
    pthread_exit(nullptr);
}

int main()
{
    double * sinTable = new double[sizen];
    pthread_t threads[NUM_THREADS];
    pthread_attr_t attr;
    pthread_attr_init(&attr);
    pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

    clock_t start, finish;

    start = clock();
    int rc;
    Data dt[NUM_THREADS];
    for(int i = 0; i < NUM_THREADS; ++i) {
        dt[i].pSinTable = sinTable;
        dt[i].tid = i;
        rc = pthread_create(&threads[i], &attr, compute, &dt[i]);
    }//for
    pthread_attr_destroy(&attr);
    for(int i = 0; i < NUM_THREADS; ++i) {
        rc = pthread_join(threads[i], nullptr);
    }//for
    finish = clock();
    printf("from pthread: %lf\n", (double)(finish - start)/CLOCKS_PER_SEC);

    delete sinTable;
    sinTable = new double[sizen];

    start = clock();
#   pragma omp parallel for
    for(int n = 0; n < sizen; ++n)
        sinTable[n] = std::sin(2 * M_PI * n / sizen);
    finish = clock();
    printf("from omp: %lf\n", (double)(finish - start)/CLOCKS_PER_SEC);

    delete sinTable;
    sinTable = new double[sizen];

    start = clock();
    for(int n = 0; n < sizen; ++n)
        sinTable[n] = std::sin(2 * M_PI * n / sizen);
    finish = clock();
    printf("from serial: %lf\n", (double)(finish - start)/CLOCKS_PER_SEC);

    delete sinTable;

    pthread_exit(nullptr);
    return 0;
}

Output:

from pthread: 21.150000
from omp: 20.940000
from serial: 20.800000

I wonder whether it was my code's problem so I used pthread to do the same thing.

However, I'm totally wrong, and I wonder whether it might be Ubuntu's problem on OpenMP/pthread.

I have a friend who has AMD CPU and Ubuntu 12.04 as well, and got the same problem there, so I might have some reason to believe that the problem is not limited to only me.

If anyone has the same problem as me, or has some clue on the problem, thanks in advance.


If the code is not good enough, I ran a benchmark and I pasted the result here:

http://pastebin.com/RquLPREc

The benchmark url: http://www.cs.kent.edu/~farrell/mc08/lectures/progs/openmp/microBenchmarks/src/download.html


New infomation:

I ran the code on windows (without pthread version) with VS2012.

I used 1/10 of sizen because windows does not allow me to allocate that great trunk of memory where the results are:

from omp: 1.004
from serial: 1.420
from FreeNickName: 735 (this one is the suggestion improvement by @FreeNickName)

Does this indicate that it could be a problem of Ubuntu OS ??



Problem is solved by using omp_get_wtime function that is portable among Operating Systems. See the answer by Hristo Iliev.


Some tests about the controversial topic by FreeNickName.

(Sorry I need to test it on Ubuntu cause the windows was one of my friends'.)

--1-- Change from delete to delete [] : (but without memset)(-std=c++11 -fopenmp)

from pthread: 13.491405
from omp: 13.023099
from serial: 20.665132
from FreeNickName: 12.022501

--2-- With memset immediately after new: (-std=c++11 -fopenmp)

from pthread: 13.996505
from omp: 13.192444
from serial: 19.882127
from FreeNickName: 12.541723

--3-- With memset immediately after new: (-std=c++11 -fopenmp -march=native -O2)

from pthread: 11.886978
from omp: 11.351801
from serial: 17.002865
from FreeNickName: 11.198779

--4-- With memset immediately after new, and put FreeNickName's version before OMP for version: (-std=c++11 -fopenmp -march=native -O2)

from pthread: 11.831127
from FreeNickName: 11.571595
from omp: 11.932814
from serial: 16.976979

--5-- With memset immediately after new, and put FreeNickName's version before OMP for version, and set NUM_THREADS to 5 instead of 2 (I'm dual core).

from pthread: 9.451775
from FreeNickName: 9.385366
from omp: 11.854656
from serial: 16.960101
Это было полезно?

Решение

There is nothing wrong with OpenMP in your case. What is wrong is the way you measure the elapsed time.

Using clock() to measure the performance of multithreaded applications on Linux (and most other Unix-like OSes) is a mistake since it does not return the wall-clock (real) time but instead the accumulated CPU time for all process threads (and on some Unix flavours even the accumulated CPU time for all child processes). Your parallel code shows better performance on Windows since there clock() returns the real time and not the accumulated CPU time.

The best way to prevent such discrepancies is to use the portable OpenMP timer routine omp_get_wtime():

double start = omp_get_wtime();
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
    sinTable[n] = std::sin(2 * M_PI * n / sizen);
double finish = omp_get_wtime();
printf("from omp: %lf\n", finish - start);

For non-OpenMP applications, you should use clock_gettime() with the CLOCK_REALTIME clock:

struct timespec start, finish;
clock_gettime(CLOCK_REALTIME, &start);
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
    sinTable[n] = std::sin(2 * M_PI * n / sizen);
clock_gettime(CLOCK_REALTIME, &finish);
printf("from omp: %lf\n", (finish.tv_sec + 1.e-9 * finish.tv_nsec) -
                          (start.tv_sec + 1.e-9 * start.tv_nsec));

Другие советы

The Linux scheduler, in the absence of any information, will tend to schedule threads in a process on the same core so that they are served by the same cache and memory bus. It has no way of knowing that your threads will be accessing different memory so won't be hurt instead of helped by being on different cores.

Use the sched_setaffinity function to set each thread to a different core mask.

WARNING: tho answer is controversial. The trick described below is implementation dependent and can lead to a decrease of performance. Still, it might increase it as well. I strongly recommend to take a look at comments to this answer.


This doesn't really answer the question, but if you alter the way you parallelize your code, you might get a performance boost. Now you do it like this:

#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
    sinTable[n] = std::sin(2 * M_PI * n / sizen);

In this case each thread will compute one item. Since you have 2 cores, OpenMP will create two threads by default. To calculate each value a thread would have to:

  1. Initialize.
  2. Compute values.

The first step is rather expensive. And both your threads would have to do it sizen/2 times. Try to do the following:

    int workloadPerThread = sizen / NUM_THREADS;
    #pragma omp parallel for
    for (int thread = 0; thread < NUM_THREADS; ++thread)
    {
        int start = thread * workloadPerThread;
        int stop = start + workloadPerThread;
        if (thread == NUM_THREADS - 1)
                stop += sizen % NUM_THREADS;
        for (int n = start; n < stop; ++n)
            sinTable[n] = std::sin(2 * M_PI * n / sizen);
    }

This way your threads will initialize only once.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top