Question

I'm trying to super-optimize some code, and a place that I'd like to speed up is the following.

I'd like to take the answer of a dot-product operation (_mm_dp_ps) which is an _m128, and save the answer directly into a register. However, using _mm_store, this would mean that I'd have to write a full 128 bits to an array, then load the first entry of that array.

Call my _m128 variable "vector".

Can I do float ans = *((float *)&vector)?

If this works the question of whether it even helps remains. Will ans be loaded into a register, or will I have to load it from L1 regardless?

Thank you!!!

Was it helpful?

Solution

The result is actually already in register; you simply need to tell the compiler to interpret it as a scalar instead of a vector. You're looking for the _mm_cvtss_f32 intrinsic:

float result = _mm_cvtss_f32(vector_result);

OTHER TIPS

Just worth pointing out taht if you're only using a single value, you ought to substitute ss intrinsics instead of ps where available; in this case, _mm_store_ss is perfectly valid for storing the low value to a single precision float without needing to use _mm_cvtss_f32.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top