See section 9.11 Explicit cache control in Agner Fog's manual optimizing cpp. Particularly look at Example 9.6b (posted below) on page 99 which shows how to do writes without reading a cache line using the _mm_stream_pi
intrinsic. I have not tried it myself yet but it's worth looking into. This helps when the matrix size is a multiple of of critical stride. However, the better solution is probably to change your code and use loop tiling (see example 9.5b) but you said you did not want to change the structure of the loop so using _mm_stream_ps
may be the best option.
// From Agner Fog's manual optimizing cpp on page 99
// Example 9.6b
#include "xmmintrin.h" // header for intrinsic functions
// This function stores a double without loading a cache line:
static inline void StoreNTD(double * dest, double const & source) {
_mm_stream_pi((__m64*)dest, *(__m64*)&source); // MOVNTQ
_mm_empty(); // EMMS
}
const int SIZE = 512; // number of rows and columns in matrix
// function to transpose and copy matrix
void TransposeCopy(double a[SIZE][SIZE], double b[SIZE][SIZE]) {
int r, c;
for (r = 0; r < SIZE; r++) {
for (c = 0; c < SIZE; c++) {
StoreNTD(&a[c][r], b[r][c]);
}
}
}