I am attempting to obtain an average MFLOPS/S rate over many iterations for the cblas_dgemm function from the Accelerate Mac OS X framework. This is the code I am using (it calls cblas_dgemm via the function pointer afp):
double benchmark_cblas_matmul(dgemm_fp afp,
const CBLAS_ORDER Order,
const CBLAS_TRANSPOSE TransA,
const CBLAS_TRANSPOSE TransB,
const int M,
const int N,
const int K,
const double alpha,
const double *A,
const int lda,
const double *B,
const int ldb,
const double beta,
double *C,
const int ldc)
{
double mflops_s,seconds = -1.0;
for(int n_iterations = 1; seconds < 0.1; n_iterations *= 2)
{
seconds = read_timer();
for(int i = 0; i < n_iterations; ++i)
{
(*afp)(Order,TransA,TransB,M,N,K,alpha,A,lda,B,ldb,beta,C,ldc);
}
seconds = read_timer() - seconds;
mflops_s = (2e-6*n_iterations*N*N*N)/seconds;
}
return mflops_s;
}
The timer routine is:
double read_timer( )
{
static bool initialized = false;
static struct timeval start;
struct timeval end;
if( !initialized )
{
gettimeofday( &start, NULL );
initialized = true;
}
gettimeofday( &end, NULL );
return (end.tv_sec - start.tv_sec) + 1.0e-6 * (end.tv_usec - start.tv_usec);
}
The code typically runs a multiply of two 1000x1000 matrices. My problem is that consecutive timings of this code are extremely unreliable; even when the timing limit in the outer loop is increased to five seconds, the final rate varies between 20000 and 30000 mflops/s. I am on a 2011 Macbook Pro with OS X 10.8.2, with a quad core i5 with hyperthreading turned off with this kernel extension and no applications running except for Terminal when I benchmark. Does anyone have any suggestion for how to obtain more stable timings?