Low FLOPS measurement for adding two 2D arrays

Question 1

Just ran that on my Core 2 Duo with 64 Bit Ubuntu. Your measured MFLOPS appear to be for no optimisation (I got 133 MFLOPS). Using compile option -O3 produced 1600 teraflops as the results are not used. Including one results number in the print statement lead to 530 to 630 MFLOPS but, this PC requires maximum CPU MHz to be selected in Power Saving options and, on setting this, produced a steady 789 MFLOPS. A 32 bit compilation would be different.

Question 2

I took the liberty of rewriting your code a tiny bit in the hope of giving a slightly better idea of what you can hope to accomplish. Mostly I set the code to run for a fixed number of iterations:

for (int i = 0; i < 10000; i++) {
    for (int x = 0; x < n1; x++){
        for (int y = 0; y < n2; y++){
            array3[x][y] = array2[x][y] + array1[x][y];
            for (int k = 0; k < iters; k++)
                array3[x][y] += array2[x][y];           
        }
    }
    ++count;
}

That may not immediately seem like a good thing, but I wanted to use OpenMP to run the code in parallel, and it can only execute a counted loop in parallel. To enable it, I added this line before the loops above:

#pragma omp parallel for reduction(+:count)

Then I added -openmp when compiling the code, and voila, the code suddenly runs in parallel on all the available cores. On my ancient desktop (2.6 GHz Athlon 64X2), that got the reported speed up to around 1400 megaFLOPS (vs. 1060 megaFLOPS without OpenMP).

On my laptop (Intel i7-3630QM) it hits around 9000 megaFLOPS (but it's thermally limited, so the speed depends on how many iterations it runs--run it too long and it throttles back to around 7800 megaFLOPS). Even running on a single core, it still manages a little over 2800 megaFLOPS.

FWIW, full source code of the version I tested:

#include <time.h>
#include <iostream>
#include <stdlib.h>

class Stopwatch {
    clock_t start_;
public:
    Stopwatch() : start_(clock()) {}
    double stop() { return double(clock()-start_) / CLOCKS_PER_SEC; }
};

int main() {
    static const int n1 = 500;
    static const int n2 = 501;
    static double array1[n1][n2], array2[n1][n2], array3[n1][n2];

    for (int i = 0; i < n1; i++) {
        for (int j = 0; j < n2; j++) {
            array1[i][j] = 1.0 / rand();
            array2[i][j] = 1.0 / rand();
        }
    }

    int iters = 7;

    int count = 0;
    Stopwatch sw;

#pragma omp parallel for reduction(+:count)
    for (int i = 0; i < 10000; i++) {
        for (int x = 0; x < n1; x++){
            for (int y = 0; y < n2; y++){
                array3[x][y] = array2[x][y] + array1[x][y];
                for (int k = 0; k < iters; k++)
                    array3[x][y] += array2[x][y];           
            }
        }
        ++count;
    }
    double t = sw.stop();

    std::cout << "ignore:";
    for (int i = 0; i < 10; i++)
        std::cout << array3[rand() % n1][rand() % n2] << "\t";
    std::cout << "\nQuit ignoring\n";

    std::cout << "n1: " << n1 << std::endl;
    std::cout << "n2: " << n2 << std::endl;
    std::cout << "count: " << count << std::endl;
    std::cout << "iters: " << iters << std::endl;
    std::cout << "Time: " << t << std::endl;


    double ops = 1.0e-6 * n1 * n2 * count * iters;
    double mflops = ops / t;
    std::cout << mflops << " MegaFLOPS" << std::endl;
}