There's no reason why you couldn't put x
and y
side by side in one __m128
, shortening the code somewhat:
for (unsigned int i = 0; i < PARTICLES; i += 2) {
// Particle position/velocity x & y
__m128 pos = _mm_set_ps(m_Particle[i]->x, m_Particle[i+1]->x,
m_Particle[i]->y, m_Particle[i+1]->y);
__m128 vel = _mm_set_ps(m_Particle[i]->vx, m_Particle[i+1]->vx,
m_Particle[i]->vy, m_Particle[i+1]->vy);
union { float pnew[4]; __m128 pnew4; };
pnew4 = _mm_add_ps(pos, vel);
m_Particle[i+0]->x = pnew[0]; // Particle i + 0
m_Particle[i+0]->y = pnew[2];
m_Particle[i+1]->x = pnew[1]; // Particle i + 1
m_Particle[i+1]->y = pnew[3];
}
But really, you've encountered the "Array of structs" vs. "Struct of arrays" issue. SSE code works better with a "Struct of arrays" like:
struct Particles
{
float x[PARTICLES];
float y[PARTICLES];
float xv[PARTICLES];
float yv[PARTICLES];
};
Another option is a hybrid approach:
struct Particles4
{
__m128 x;
__m128 y;
__m128 xv;
__m128 yv;
};
Particles4 particles[PARTICLES / 4];
Either way will give simpler and faster code than your example.