Question

I am having trouble with the _mm_store_ps command. I am getting a segmentation fault when I use it (and I know that is the problem because when I comment out that line the segmentation fault goes away). It is strange though because I am using a static array which I manually ask the compiler to align, and using _mm_storeu_ps does not make the problem go away. Here is the relevant section of code:

//Directly access array instead of using Boost interface
boost::numeric::ublas::matrix<float>::iterator2 it = result.begin2();
float temp[4] __attribute__((aligned__(16))), temp2 = 0;

//Use SSE
__m128 m1, sse_right1, sse_left1, store_sse __attribute__((aligned (16))) = _mm_set_ps1(0);

unsigned k = 0;
//Iterate over the dimensions of the matrices
for (unsigned i = 0; i < ls1; i++)
{   
    for (unsigned j = 0; j < rs2; j++)
    {   
        while (k + 3 < ls2)
        {   
            sse_right1 = _mm_load_ps(arr + k + j * rs1); 
            sse_left1 = _mm_load_ps(left_arr + k + i * ls2);
            m1 = _mm_mul_ps(sse_right1, sse_left1);
            store_sse = _mm_add_ps(store_sse,m1);
            k += 4;
        }   

        //If ls2 isn't divisible by 4
        while (k < ls2)
        {   
            temp2 += left_arr[i * ls2 + k] * arr[k + j * rs1];
            k++;
        }   

        if (ls2 >= 4)
        {   
            _mm_store_ps(temp, store_sse);

            for (unsigned l = 0; l < 4; l++)
            {   
                temp2 += temp[l];
            }   
        }   

        *it = temp2;
        store_sse = _mm_set_ps1(0);
        temp2 = 0;
        k = 0;
        it++;
    }   

The segmentation fault isn't a problem with the array bounds because the execution makes it down to the _mm_store_ps line. Any help would be appreciated, thanks!

Edit: The problem is with _mm_load_ps, when I use _mm_loadu_ps it runs fine. I am using static arrays as the arguments to _mm_load_ps, so I don't know why I am having problems.

Était-ce utile?

La solution

SSE requires its memory access to be with 16-byte aligned addresses. If you're not reading from outside of the array, this is likely your problem.

Try using _mm_storeu_ps and _mm_loadu_ps, which are unaligned versions. They will run a little slower, but they will work. After you've verified that's the problem, try aligning the memory in the first place for maximum performance.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top