SSE copy data to variables

Question 1

There's no reason why you couldn't put x and y side by side in one __m128, shortening the code somewhat:

for (unsigned int i = 0; i < PARTICLES; i += 2) {
    // Particle position/velocity x & y
    __m128 pos = _mm_set_ps(m_Particle[i]->x, m_Particle[i+1]->x, 
                            m_Particle[i]->y, m_Particle[i+1]->y);
    __m128 vel = _mm_set_ps(m_Particle[i]->vx, m_Particle[i+1]->vx,
                            m_Particle[i]->vy, m_Particle[i+1]->vy);

    union { float pnew[4]; __m128 pnew4; };
    pnew4 = _mm_add_ps(pos, vel);

    m_Particle[i+0]->x = pnew[0]; // Particle i + 0
    m_Particle[i+0]->y = pnew[2];
    m_Particle[i+1]->x = pnew[1]; // Particle i + 1
    m_Particle[i+1]->y = pnew[3];
}

But really, you've encountered the "Array of structs" vs. "Struct of arrays" issue. SSE code works better with a "Struct of arrays" like:

struct Particles
{
    float x[PARTICLES];
    float y[PARTICLES];
    float xv[PARTICLES];
    float yv[PARTICLES];
};

Another option is a hybrid approach:

struct Particles4
{
    __m128 x;
    __m128 y;
    __m128 xv;
    __m128 yv;
};

Particles4 particles[PARTICLES / 4];

Either way will give simpler and faster code than your example.

Question 2

I went a slightly different route to simplify: process 2 elements per iteration and pack them as (x,y,x,y) instead of (x,x,x,x) and (y,y,y,y) as you did.

If in your particle class x and y are contiguous floats and you align fields on 32 bits, a single operation loading a x as a double will in fact load the two floats x and y.

for (unsigned int i = 0; i < PARTICLES; i += 2) {
    __m128 pos = _mm_set1_pd(0); // zero vector
    // I assume x and y are contiguous in memory
    // so loading a double at x loads 2 floats: x and the following y.
           pos = _mm_loadl_pd(pos, (double*)&m_Particle[i  ]->x);
    // a register can contain 4 floats so 2 positions
           pos = _mm_loadh_pd(pos, (double*)&m_Particle[i+1]->x);

    // same for velocities
    __m128 vel = _mm_set1_pd(0);
           vel = _mm_loadl_pd(pos, (double*)&m_Particle[i  ]->vx);
           vel = _mm_loadh_pd(pos, (double*)&m_Particle[i+1]->vy);

    pos = _mm_add_ps(pos, vel); // do the math

    // store the same way as load
    _mm_storel_pd(&m_Particle[i  ]->x, pos);
    _mm_storeh_pd(&m_Particle[i+1]->x, pos);
}

Also, since you mention particle, do you intend to draw them with OpenGL / DirectX ? If so, you could perform this kind of permutation on the GPU faster while also avoiding data transfers from main memory to GPU, so it's a gain on all fronts.

If that's not the case and you intend to stay on the CPU, using an SSE friendly layout like one array for positions and one for velocities could be a solution:

struct particle_data {
    std::vector<float> xys, vxvys;
};

But it would have the drawback of either breaking your architecture or requiring a copy from your current array of structs to a temporary struct of arrays. The compute would be faster but the additional copy might outweigh that. Only benchmarking can show...

A last option is to sacrifice a little performance and load your data as it is, and use SSE shuffle instructions to rearrange the data locally at each iteration. But arguably this would make the code even harder to maintain.

Question 3

For the design of performance you should avoid handling array of structure but you should work with structure of array.