Addressing a non-integer address, and sse

https://stackoverflow.com/questions/19376042

30-06-2022
|

質問

I am trying to accelerate my code using SSE, and the following code works well. Basically a __m128 variable should point to 4 floats in a row, in order to do 4 operations at once.

This code is equivalent to computing c[i]=a[i]+b[i] with i from 0 to 3.

float *data1,*data2,*data3
// ... code ... allocating data1-2-3 which are very long.
__m128* a = (__m128*) (data1);
__m128* b = (__m128*) (data2);
__m128* c = (__m128*) (data3);
*c = _mm_add_ps(*a, *b);

However, when I want to shift a bit the data that I use (see below), in order to compute c[i]=a[i+1]+b[i] with i from 0 to 3, it crashes at execution time.

__m128* a = (__m128*) (data1+1); // <-- +1
__m128* b = (__m128*) (data2);
__m128* c = (__m128*) (data3);
*c = _mm_add_ps(*a, *b);

My guess is that it is related to the fact that __m128 is 128 bits and by float data are 32 bits. So, it may be impossible for a 128-bit pointer to point on an address that is not divisible by 128.

Anyway, do you know what the problem is and how I could go around it?

解決

Instead of using implicit aligned loads/stores like this:

__m128* a = (__m128*) (data1+1); // <-- +1
__m128* b = (__m128*) (data2);
__m128* c = (__m128*) (data3);
*c = _mm_add_ps(*a, *b);

use explicit aligned/unaligned loads/stores as appropriate, e.g.:

__m128 va = _mm_loadu_ps(data1+1); // <-- +1 (NB: use unaligned load)
__m128 vb = _mm_load_ps(data2);
__m128 vc = _mm_add_ps(va, vb);
_mm_store_ps(data3, vc);

Same amount of code (i.e. same number of instructions), but it won't crash, and you have explicit control over which loads/stores are aligned and which are unaligned.

Note that recent CPUs have relatively small penalties for unaligned loads, but on older CPUs there can be a 2x or greater hit.

他のヒント

Your problem here is that a ends up pointing to something that is not a __m128; it points to something that contains the last 96 bits of an __m128 and 32 bits outside, which can be anything. It may be the first 32 bits of the next __m128, but eventually, when you arrive at the last __m128 in the same memory block, it will be something else. Maybe reserved memory that you cannot access, hence the crash.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow