Non-blocking memory write in x86 instructions?

https://stackoverflow.com/questions/20166121

04-08-2022
|

Question

I am writing some highly optimised code and here is one thing that has bugged me for quite a while, I have a triple-for loop like fellows:

 for(int ii = 0; ii < ny; ii++){
     for(int jj = 0; jj < nx; jj++){
        ....some serious calculation....
        for(int kk = 0; kk < CONSTANT; kk++){
            _mm_storeu_ps(&((cells.dir[kk])[ii * nx +jj],result); // Writing result to correct location
        }
     }
 }

The cells is just a struct of 9 pointers, each pointing to a large size array. The code is originally written in array of struct(AoS) mode, I manually rewrote the whole thing to use struct of array so I can use SSE to speed it up. But due to the original structure of the code , the code above has to write the result to the correct location in a cache-unfriendly way, If I comment that line out the running time of my whole program can taken down by more than 40%. I am just wondering if there is any non-blocking memory write instruction for x86 that I can take advantage of? Or some other tricks I can play with this memory write? Please do not suggest to change the structure of the loop, it is too time costing.

Thanks, Bob

Solution

See section 9.11 Explicit cache control in Agner Fog's manual optimizing cpp. Particularly look at Example 9.6b (posted below) on page 99 which shows how to do writes without reading a cache line using the _mm_stream_pi intrinsic. I have not tried it myself yet but it's worth looking into. This helps when the matrix size is a multiple of of critical stride. However, the better solution is probably to change your code and use loop tiling (see example 9.5b) but you said you did not want to change the structure of the loop so using _mm_stream_ps may be the best option.

// From Agner Fog's manual optimizing cpp on page 99
// Example 9.6b
#include "xmmintrin.h" // header for intrinsic functions
// This function stores a double without loading a cache line:
static inline void StoreNTD(double * dest, double const & source) {
    _mm_stream_pi((__m64*)dest, *(__m64*)&source); // MOVNTQ
    _mm_empty(); // EMMS
}
const int SIZE = 512; // number of rows and columns in matrix
// function to transpose and copy matrix
void TransposeCopy(double a[SIZE][SIZE], double b[SIZE][SIZE]) {
    int r, c;
    for (r = 0; r < SIZE; r++) {
        for (c = 0; c < SIZE; c++) {
            StoreNTD(&a[c][r], b[r][c]);
        }
    }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow