Unstable Profiling Timings

https://stackoverflow.com/questions/13885281

08-12-2021
|

Question

I am attempting to obtain an average MFLOPS/S rate over many iterations for the cblas_dgemm function from the Accelerate Mac OS X framework. This is the code I am using (it calls cblas_dgemm via the function pointer afp):

double benchmark_cblas_matmul(dgemm_fp afp,
   const CBLAS_ORDER Order,
   const CBLAS_TRANSPOSE TransA,
   const CBLAS_TRANSPOSE TransB,
   const int M,
   const int N,
   const int K,
   const double alpha,
   const double *A,
   const int lda,
   const double *B,
   const int ldb,
   const double beta,
   double *C,
   const int ldc)
{
    double mflops_s,seconds = -1.0;
    for(int n_iterations = 1; seconds < 0.1;  n_iterations *= 2)
    {
        seconds = read_timer(); 
        for(int i = 0; i < n_iterations; ++i) 
        {
            (*afp)(Order,TransA,TransB,M,N,K,alpha,A,lda,B,ldb,beta,C,ldc); 
        }
        seconds = read_timer() - seconds;
        mflops_s = (2e-6*n_iterations*N*N*N)/seconds;
    }
    return mflops_s;
}

The timer routine is:

double read_timer( )
{
    static bool initialized = false;
    static struct timeval start;
    struct timeval end;
    if( !initialized )
    {
        gettimeofday( &start, NULL );
        initialized = true;
    }

    gettimeofday( &end, NULL );

    return (end.tv_sec - start.tv_sec) + 1.0e-6 * (end.tv_usec - start.tv_usec);
}

The code typically runs a multiply of two 1000x1000 matrices. My problem is that consecutive timings of this code are extremely unreliable; even when the timing limit in the outer loop is increased to five seconds, the final rate varies between 20000 and 30000 mflops/s. I am on a 2011 Macbook Pro with OS X 10.8.2, with a quad core i5 with hyperthreading turned off with this kernel extension and no applications running except for Terminal when I benchmark. Does anyone have any suggestion for how to obtain more stable timings?

Solution

There are some confounds that you haven't controlled.

The processor in question has turbo modes that allow it to run faster than nominal clock rate so long as it is not thermally constrained. However, running a sustained GEMM benchmark keeps the cores pinned at nearly peak arithmetic throughput, which will eventually result in the cores reaching the limit of their thermal envelope, and the clock will be throttled down to the nominal rate, then to even slower frequencies.

Assuming that you're seeing a downward trend in the measured performance, this may be responsible.

OTHER TIPS

The answer from Stephen Canon is probably correct. So extending the testing time until you reach a stable answer is probably a good idea. Also, the Accelerate documentation has some code that they use to test Mflops it does a lot more than yours so maybe you can find some clues as to other things to adjust on CPU before running test.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow