Compiling SSE intrinsics in GCC gives an error

https://stackoverflow.com/questions/22110916

18-10-2022
|

Question

My SSE code works completely fine on Windows platform, but when I run this on Linux I am facing many issues. One amongst them is this:

It's just a sample illustration of my code:

int main(int ref, int ref_two)

{

 __128i a, b;

a.m128i_u8[0] = ref;

b.m128i_u8[0]  = ref_two;

.

.


.

.....

}

Error 1:

error : request for member 'm128i_u8' in something not a structure or union

In this thread it gives the solution of to use appropriate _mm_set_XXX intrinsics instead of the above method as it only works on Microsoft. SSE intrinsics compiling MSDN code with GCC error?

I tried the above method mentioned in the thread I have replaced set instruction in my program but it is seriously affecting the performance of my application.

My code is massive and it needs to be changed at 2000 places. So I am looking for better alternative without affecting the performance of my app.

Recently I got this link from Intel, which says to use -fms-diaelect option to port it from windows to Linux.

http://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-7A69898B-BDBB-4AA9-9820-E4A590945903.htm

Has anybody tried the above method? Has anybody found the solution to porting large code to Linux?

@Paul, here is my code and I have placed a timer to measure the time taken by both methods and the results were shocking.

Code 1: 115 ms (Microsoft method to access elements directly)

Code 2: 151 ms (using set instruction)

It costed me a 36 ms when i used set in my code.

NOTE: If I replace in single instruction of mine it takes 36 ms and imagine the performance degrade which I am going to get if I replace it 2000 times in my program.

That's the reason I am looking for a better alternative other than set instruction

Code 1:

__m128i array;
unsigned char* temp_src;
unsigned char* temp_dst;

for (i=0; i< 20; i++)

{

for (j=0; j< 1600; j+= 16)

 {
   Timerstart(&x);  
   array =  _mm_loadu_si128 ((__m128i *)(src));
   array.m128i_u8[0] =  36;
   y+ = Timerstop(x);
   _mm_store_si128( (__m128i *)temp_dst,array);

  }
 }

Code2:

 __m128i array;
 unsigned char* temp_src;
 unsigned char* temp_dst;

 for (i=0; i< 20; i++)
 {
 for (j=0; j< 1600; j+= 16)

 {



       Timerstart(&x);
       array = _mm_set_epi8(*(src+15),*(src+14),*(src+13),*(src+12),
                             *(src+11),*(src+10),*(src+9), *(src+8),
                         *(src+7), *(src+6), *(src+5), *(src+4),
                             *(src+3), *(src+2), *(src+1), 36 );
       y+ = Timerstop(x);

       _mm_store_si128( (__m128i *)temp_dst,array);

  }
}

No correct solution

OTHER TIPS

You're trying to use a non-portable Microsoft-ism. Just stick to the more portable intrinsics, e.g. _mm_set_epi8:

 __128i a = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ref);

This will work on all platforms and compilers.

If you're seeing performance issues then it's probably because you're doing something inefficient inside a loop - without seeing the actual code though it's not possible to make any specific suggestions on making the code more efficient.

EDIT

Often there are much more efficient ways of loading a vector with a combination of values such as in your example, e.g.:

#include "smmintrin.h" // SSE4.1

for (...)
{
    for (...)
    {
        __m128i v = _mm_loadu_si128(0, (__m128i *)src); // load vector from src..src+15
        v = _mm_insert_epi8(v, 0, 36);                  // replace element 0 with constant `36`
       _mm_storeu_si128((__m128i *)dst, v);             // store vector at dst..dst+15
    }
}

This translates to just 3 instructions. (Note: if you can't assume SSE4.1 minimum then the _mm_insert_epi8 can be replaced with two bitwise intrinsics - this will still be much more efficient than using _mm_set_epi8).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow