Just exchanging
for(j=0;j<NUM;j++)
for(k=0;k<NUM;k++)
with
for(k=0;k<NUM;k++)
for(j=0;j<NUM;j++)
I have had a 43x speedup. Like you said, improving cache locality.
Some milliseconds can even be shaven by blocking, i.e., exchanging
for(j=0;j<NUM;j++)
for(j=0;j<NUM;j++)
for(k=0;k<NUM;k++)
by
for(int i0=0; i0<NUM; i0+=BLK)
for(int k0=0; k0<NUM; k0+=BLK)
for(int j0=0; j0<NUM; j0+=BLK)
for(int i=i0, ix=i0+BLK; i<ix; ++i)
for(int k=k0, kx=k0+BLK; k<kx; ++k)
for(int j=j0, jx=j0+BLK; j<jx; ++j)
(my best runs were with #define BLK 256
, but YMMV).
CLARIFYING: this is the answer given the link referenced by @Adriano, and you really should have looked it before editing the question.