SSE2: Multiplying signed integers from a 2d array with doubles and summing the results in C

StackOverflow https://stackoverflow.com/questions/23456551

  •  15-07-2023
  •  | 
  •  

Question

I am currently trying to vectorize the following piece of code:

velocity[0] = 0.0;
velocity[1] = 0.0;
velocity[2] = 0.0;
for (int i = 0; i < PARAMQ; i++)
{
    velocity[0] += currentCell[i] * LATTICEVELOCITIES[i][0];
    velocity[1] += currentCell[i] * LATTICEVELOCITIES[i][1];
    velocity[2] += currentCell[i] * LATTICEVELOCITIES[i][2];
}

where LATTICEVELOCITIES is an 2d integer array

 static const int32_t LATTICEVELOCITIES[PARAMQ][3] = {{0, -1, -1},
                                      {-1, 0, -1},
                                      {0, 0, -1},
                                      {1, 0, -1},
                                      {0, 1, -1},
                                      {-1, -1, 0},
                                      {0, -1, 0},
                                      {1, -1, 0},
                                      {-1, 0, 0},
                                      {0, 0, 0},
                                      {1, 0, 0},
                                      {-1, 1, 0},
                                      {0, 1, 0},
                                      {1, 1, 0},
                                      {0, -1, 1},
                                      {-1, 0, 1},
                                      {0, 0, 1},
                                      {1, 0, 1},
                                      {0, 1, 1}
                                     };

and currentCell is an array of doubles.

Unfortunately, all the examples I could find only deal with arrays of the same type and I don't know how to load only two integers into one 128bit register and then convert them to doubles.

Thanks for your help in advance.

Était-ce utile?

La solution

Firstly, as per the comments above, I'm going to assume that it's OK to transpose LATTICEVELOCITIES:

static const int32_t LATTICEVELOCITIES[3][PARAMQ] = {
    { 0, -1, 0, 1, 0, -1, 0, 1, -1, 0, 1, -1, 0, 1, 0, -1, 0, 1, 0 },
    { -1, 0, 0, 0, 1, -1, -1, -1, 0, 0, 0, 1, 1, 1, -1, 0, 0, 0, 1 },
    { -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1 }
};

Now let's iterate through the data, processing two elements per iteration, and have a single scalar iteration at the end to take care of the last (odd) element:

__m128d v0, v2, v2;
v0 = v1 = v2 = _mm_setzero_pd();
for (int i = 0; i < PARAMQ - 1; i += 2)
{
    __m128d vc, vl0, vl1, vl2;
    __m128i vtemp;

    vc = _mm_loadu_pd(&currentCell[i]);
    vtemp = _mm_loadu_si128((__m128i *)&LATTICEVELOCITIES[0][i]);
    vl0 = _mm_cvtepi32_pd(vtemp);
    vtemp = _mm_loadu_si128((__m128i *)&LATTICEVELOCITIES[1][i]);
    vl1 = _mm_cvtepi32_pd(vtemp);
    vtemp = _mm_loadu_si128((__m128i *)&LATTICEVELOCITIES[2][i]);
    vl2 = _mm_cvtepi32_pd(vtemp);
    v0 = _mm_add_pd(v0, _mm_mul_pd(vc, vl0));
    v1 = _mm_add_pd(v1, _mm_mul_pd(vc, vl1));
    v2 = _mm_add_pd(v2, _mm_mul_pd(vc, vl2));
}
v0 = _mm_hadd_pd(v0, v0);
v1 = _mm_hadd_pd(v1, v1);
v2 = _mm_hadd_pd(v2, v2);
_mm_store_sd(&velocity[0], v0);
_mm_store_sd(&velocity[1], v1);
_mm_store_sd(&velocity[2], v2);
if (i < PARAMQ)
{
    velocity[0] += currentCell[i] * LATTICEVELOCITIES[0][i];
    velocity[1] += currentCell[i] * LATTICEVELOCITIES[1][i];
    velocity[2] += currentCell[i] * LATTICEVELOCITIES[2][i];
}

Note that this is completely untested code - apologies for typos or bugs that need to be fixed, but the basic idea should be sound.

Note also that you should test and benchmark this against equivalent scalar code - modern CPUs typically have two FPUs so there may not be much to be gained from SSE. If you can assume Sandy Bridge/Ivy Bridge/Haswell or later though, then an AVX/AVX2 implementation should do better.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top