سؤال

I am attempting to benchmark the addition of two 2D arrays on my 2.4 GHz Intel Core 2 Duo CPU in C++. I am summing the arrays over and over so the problem becomes z=x+y+y+y+... where z, x, and y are all 2D arrays. To obtain lots of measurements for this problem, I am looping over the number of times y is added as well as the size of the arrays. Below is the log generated from running my code on my CPU.

Array Size: 500
Iterations: 2
n1: 500
n2: 501
count: 750
Time: 5.00391
75.0913 MegaFLOPS

Iterations: 4
n1: 500
n2: 501
count: 589
Time: 5.00125
118.006 MegaFLOPS

Iterations: 8
n1: 500
n2: 501
count: 343
Time: 5.00967 
137.209 MegaFLOPS

Iterations: 16
n1: 500
n2: 501
count: 185
Time: 5.00164
148.247 MegaFLOPS

Iterations: 32
n1: 500
n2: 501
count: 92
Time: 5.03487
146.473 MegaFLOPS

Iterations: 64
n1: 500
n2: 501
count: 48
Time: 5.01763
153.366 MegaFLOPS

Iterations: 128
n1: 500
n2: 501
count: 25
Time: 5.02799
159.428 MegaFLOPS

Iterations: 256
n1: 500
n2: 501
count: 13
Time: 5.16209 
161.497 MegaFLOPS

Iterations: 512
n1: 500
n2: 501   
count: 7
Time: 5.65551
158.747 MegaFLOPS

I am benchmarking for 5 seconds (Time), the first size of my arrays is 500x501 and count is the number of times the sum is done within the 5 second window.

It appears to me that the number of FLOPS calculated is quite low. Below I include the code I use for benchmarking. In my actual program, this loop is included inside another loop which iterates over the array size (n1 and n2) and the iterations (iters).

Stopwatch sw;
int maxTime = 5;
int count = 0;
sw.restart();
while (sw.getTime() < maxTime){

   for(int x = 0; x < n1; x++){
       for(int y = 0; y < n2; y++){
           array3[x][y] = array2[x][y] + array1[x][y];
               for(int k = 0; k < iters; k++){
                   array3[x][y] += array2[x][y];

                }
        }
   }        
   count++;

}
sw.stop();


std::cout << "n1: " << n1 << std::endl;
std::cout << "n2: " << n2 << std::endl;
std::cout << "count: " << count << std::endl;
std::cout << "Time: " << sw.getTime() << std::endl;

float mflops = (float)(n1*n2*count*iters*1.0e-06/sw.getTime());
std::cout << mflops << " MegaFLOPS" << std::endl;

With Java I can achieve nearly a GigaFLOP so I am confused as to why it is so slow for my C++ program.

Any help will be greatly appreciated.

EDIT:

Here is the code I used to create my performance counter ("stopwatch"):

Stopwatch::Stopwatch(){
    _running=false;
    _start=0;
    _time=0;
}

void Stopwatch::start() {
    if (!_running) {
     gettimeofday(&begtime,NULL);
     _running = true;
     _start = begtime.tv_sec + begtime.tv_usec/1.0e6;
   }  
}

void Stopwatch::stop() {
    if (_running) {
     gettimeofday(&endtime,NULL);
     _time += endtime.tv_sec + endtime.tv_usec/1.0e6 - _start;
     _running = false;
   }
}

void Stopwatch::reset() {
   stop();
   _time=0; 
}

void Stopwatch::restart() {
    reset();
    start();
 }


double Stopwatch::getTime() {
    if (_running) {
      gettimeofday(&nowtime,NULL);
      return nowtime.tv_sec + nowtime.tv_usec/1.0e6 - _start;
    }
    return _time;
}
هل كانت مفيدة؟

المحلول

Just ran that on my Core 2 Duo with 64 Bit Ubuntu. Your measured MFLOPS appear to be for no optimisation (I got 133 MFLOPS). Using compile option -O3 produced 1600 teraflops as the results are not used. Including one results number in the print statement lead to 530 to 630 MFLOPS but, this PC requires maximum CPU MHz to be selected in Power Saving options and, on setting this, produced a steady 789 MFLOPS. A 32 bit compilation would be different.

نصائح أخرى

I took the liberty of rewriting your code a tiny bit in the hope of giving a slightly better idea of what you can hope to accomplish. Mostly I set the code to run for a fixed number of iterations:

for (int i = 0; i < 10000; i++) {
    for (int x = 0; x < n1; x++){
        for (int y = 0; y < n2; y++){
            array3[x][y] = array2[x][y] + array1[x][y];
            for (int k = 0; k < iters; k++)
                array3[x][y] += array2[x][y];           
        }
    }
    ++count;
}

That may not immediately seem like a good thing, but I wanted to use OpenMP to run the code in parallel, and it can only execute a counted loop in parallel. To enable it, I added this line before the loops above:

#pragma omp parallel for reduction(+:count)

Then I added -openmp when compiling the code, and voila, the code suddenly runs in parallel on all the available cores. On my ancient desktop (2.6 GHz Athlon 64X2), that got the reported speed up to around 1400 megaFLOPS (vs. 1060 megaFLOPS without OpenMP).

On my laptop (Intel i7-3630QM) it hits around 9000 megaFLOPS (but it's thermally limited, so the speed depends on how many iterations it runs--run it too long and it throttles back to around 7800 megaFLOPS). Even running on a single core, it still manages a little over 2800 megaFLOPS.

FWIW, full source code of the version I tested:

#include <time.h>
#include <iostream>
#include <stdlib.h>

class Stopwatch {
    clock_t start_;
public:
    Stopwatch() : start_(clock()) {}
    double stop() { return double(clock()-start_) / CLOCKS_PER_SEC; }
};

int main() {
    static const int n1 = 500;
    static const int n2 = 501;
    static double array1[n1][n2], array2[n1][n2], array3[n1][n2];

    for (int i = 0; i < n1; i++) {
        for (int j = 0; j < n2; j++) {
            array1[i][j] = 1.0 / rand();
            array2[i][j] = 1.0 / rand();
        }
    }

    int iters = 7;

    int count = 0;
    Stopwatch sw;

#pragma omp parallel for reduction(+:count)
    for (int i = 0; i < 10000; i++) {
        for (int x = 0; x < n1; x++){
            for (int y = 0; y < n2; y++){
                array3[x][y] = array2[x][y] + array1[x][y];
                for (int k = 0; k < iters; k++)
                    array3[x][y] += array2[x][y];           
            }
        }
        ++count;
    }
    double t = sw.stop();

    std::cout << "ignore:";
    for (int i = 0; i < 10; i++)
        std::cout << array3[rand() % n1][rand() % n2] << "\t";
    std::cout << "\nQuit ignoring\n";

    std::cout << "n1: " << n1 << std::endl;
    std::cout << "n2: " << n2 << std::endl;
    std::cout << "count: " << count << std::endl;
    std::cout << "iters: " << iters << std::endl;
    std::cout << "Time: " << t << std::endl;


    double ops = 1.0e-6 * n1 * n2 * count * iters;
    double mflops = ops / t;
    std::cout << mflops << " MegaFLOPS" << std::endl;
}
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top