The result is actually already in register; you simply need to tell the compiler to interpret it as a scalar instead of a vector. You're looking for the _mm_cvtss_f32
intrinsic:
float result = _mm_cvtss_f32(vector_result);
Question
I'm trying to super-optimize some code, and a place that I'd like to speed up is the following.
I'd like to take the answer of a dot-product operation (_mm_dp_ps) which is an _m128, and save the answer directly into a register. However, using _mm_store, this would mean that I'd have to write a full 128 bits to an array, then load the first entry of that array.
Call my _m128 variable "vector".
Can I do float ans = *((float *)&vector)?
If this works the question of whether it even helps remains. Will ans be loaded into a register, or will I have to load it from L1 regardless?
Thank you!!!
Solution
The result is actually already in register; you simply need to tell the compiler to interpret it as a scalar instead of a vector. You're looking for the _mm_cvtss_f32
intrinsic:
float result = _mm_cvtss_f32(vector_result);
OTHER TIPS
Just worth pointing out taht if you're only using a single value, you ought to substitute ss intrinsics instead of ps where available; in this case, _mm_store_ss
is perfectly valid for storing the low value to a single precision float without needing to use _mm_cvtss_f32
.