Question

I'm optimizing a piece of code that moves particles on the screen around gravity fields. For this we're told to use SSE. Now after rewriting this little bit of code, I was wondering if there is an easier/smaller way of storing the values back in the array of particles.

Here's the code before:

for (unsigned int i = 0; i < PARTICLES; i++) {
    m_Particle[i]->x += m_Particle[i]->vx;
    m_Particle[i]->y += m_Particle[i]->vy;
}

And here's the code after:

for (unsigned int i = 0; i < PARTICLES; i += 4) {
    // Particle position/velocity x & y
    __m128 ppx4 = _mm_set_ps(m_Particle[i]->x, m_Particle[i+1]->x,
                             m_Particle[i+2]->x, m_Particle[i+3]->x);
    __m128 ppy4 = _mm_set_ps(m_Particle[i]->y, m_Particle[i+1]->y,
                             m_Particle[i+2]->y, m_Particle[i+3]->y);
    __m128 pvx4 = _mm_set_ps(m_Particle[i]->vx, m_Particle[i+1]->vx,
                             m_Particle[i+2]->vx, m_Particle[i+3]->vx);
    __m128 pvy4 = _mm_set_ps(m_Particle[i]->vy, m_Particle[i+1]->vy,
                             m_Particle[i+2]->vy, m_Particle[i+3]->vy);

    union { float newx[4]; __m128 pnx4; };
    union { float newy[4]; __m128 pny4; };
    pnx4 = _mm_add_ps(ppx4, pvx4);
    pny4 = _mm_add_ps(ppy4, pvy4);

    m_Particle[i+0]->x = newx[3]; // Particle i + 0
    m_Particle[i+0]->y = newy[3];
    m_Particle[i+1]->x = newx[2]; // Particle i + 1
    m_Particle[i+1]->y = newy[2];
    m_Particle[i+2]->x = newx[1]; // Particle i + 2
    m_Particle[i+2]->y = newy[1];
    m_Particle[i+3]->x = newx[0]; // Particle i + 3
    m_Particle[i+3]->y = newy[0];
}

It works, but it looks way too large for something as simple as adding a value to another value. Is there a shorter way of doing this without changing the m_Particle structure?

Était-ce utile?

La solution

There's no reason why you couldn't put x and y side by side in one __m128, shortening the code somewhat:

for (unsigned int i = 0; i < PARTICLES; i += 2) {
    // Particle position/velocity x & y
    __m128 pos = _mm_set_ps(m_Particle[i]->x, m_Particle[i+1]->x, 
                            m_Particle[i]->y, m_Particle[i+1]->y);
    __m128 vel = _mm_set_ps(m_Particle[i]->vx, m_Particle[i+1]->vx,
                            m_Particle[i]->vy, m_Particle[i+1]->vy);

    union { float pnew[4]; __m128 pnew4; };
    pnew4 = _mm_add_ps(pos, vel);

    m_Particle[i+0]->x = pnew[0]; // Particle i + 0
    m_Particle[i+0]->y = pnew[2];
    m_Particle[i+1]->x = pnew[1]; // Particle i + 1
    m_Particle[i+1]->y = pnew[3];
}

But really, you've encountered the "Array of structs" vs. "Struct of arrays" issue. SSE code works better with a "Struct of arrays" like:

struct Particles
{
    float x[PARTICLES];
    float y[PARTICLES];
    float xv[PARTICLES];
    float yv[PARTICLES];
};

Another option is a hybrid approach:

struct Particles4
{
    __m128 x;
    __m128 y;
    __m128 xv;
    __m128 yv;
};

Particles4 particles[PARTICLES / 4];

Either way will give simpler and faster code than your example.

Autres conseils

I went a slightly different route to simplify: process 2 elements per iteration and pack them as (x,y,x,y) instead of (x,x,x,x) and (y,y,y,y) as you did.

If in your particle class x and y are contiguous floats and you align fields on 32 bits, a single operation loading a x as a double will in fact load the two floats x and y.

for (unsigned int i = 0; i < PARTICLES; i += 2) {
    __m128 pos = _mm_set1_pd(0); // zero vector
    // I assume x and y are contiguous in memory
    // so loading a double at x loads 2 floats: x and the following y.
           pos = _mm_loadl_pd(pos, (double*)&m_Particle[i  ]->x);
    // a register can contain 4 floats so 2 positions
           pos = _mm_loadh_pd(pos, (double*)&m_Particle[i+1]->x);

    // same for velocities
    __m128 vel = _mm_set1_pd(0);
           vel = _mm_loadl_pd(pos, (double*)&m_Particle[i  ]->vx);
           vel = _mm_loadh_pd(pos, (double*)&m_Particle[i+1]->vy);

    pos = _mm_add_ps(pos, vel); // do the math

    // store the same way as load
    _mm_storel_pd(&m_Particle[i  ]->x, pos);
    _mm_storeh_pd(&m_Particle[i+1]->x, pos);
}

Also, since you mention particle, do you intend to draw them with OpenGL / DirectX ? If so, you could perform this kind of permutation on the GPU faster while also avoiding data transfers from main memory to GPU, so it's a gain on all fronts.

If that's not the case and you intend to stay on the CPU, using an SSE friendly layout like one array for positions and one for velocities could be a solution:

struct particle_data {
    std::vector<float> xys, vxvys;
};

But it would have the drawback of either breaking your architecture or requiring a copy from your current array of structs to a temporary struct of arrays. The compute would be faster but the additional copy might outweigh that. Only benchmarking can show...

A last option is to sacrifice a little performance and load your data as it is, and use SSE shuffle instructions to rearrange the data locally at each iteration. But arguably this would make the code even harder to maintain.

For the design of performance you should avoid handling array of structure but you should work with structure of array.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top