Before AVX-512, x86 doesn't have unsigned <-> FP instructions.
(With AVX-512F, see vcvtusi2sd
and vcvtsd2usi
, and their respective ss
versions. Also packed SIMD conversions involving 64-bit integers which is also new; before AVX-512F, packed conversions conversions could go to/from int32_t.)
In 64-bit code, unsigned 32-bit -> FP is easy: just zero-extend u32 to i64 and use signed 64-bit conversion. Every uint32_t value is representable as a non-negative int64_t.
For the reverse direction, convert FP -> i64 and truncate to u32, if you're ok with what happens for out-of-range FP inputs. (Including 0 when out-of-range for i64, otherwise taking the low32 of the 2's complement i64 bit pattern.)
u32 -> FP: See @Igor Skochinsky's answer for compiler output. x86-64 GCC and Clang use the same trick as x64 MSVC. The key part is to zero-extend it to 64-bit and convert. Note that writing a 32-bit register implicitly zero-extends to 64-bit, so you may not need the mov r32, r32
if you know the value was written with a 32-bit operation. (Or if you have to load it from memory yourself).
; assuming your input starts in EDI, and that RDI might have garbage in the high half
; like a 32-bit function arg.
mov eax, edi ; mov-elimination wouldn't work with edi,edi
vcvtsi2sd xmm0, xmm7, rax ; where XMM7 is some cold register to avoid a false dep
The choice of anything other than mov edi,edi
(if you need a separate instruction for zero-extension) is motivated by mov-elimination not working in the same,same register case: see Can x86's MOV really be "free"? Why can't I reproduce this at all?.
If you don't have AVX, or don't know a not-recently-written register to use, you may want to use pxor xmm0, xmm0
before the poorly-designed cvtsi2sd
merges into it. GCC breaks false deps religiously, clang is pretty cavalier unless a loop-carried dep chain would exist inside a single function. So it can be slowed down by interactions between separate non-inlined functions that might happen to get called in a loop. See Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? for an example where this bites clang (but GCC is fine.)
That answer also links some GCC missed-optimization bug reports where I wrote more details about the idea of reusing a "cold" register to avoid false dependencies in conversion and stuff like [v]sqrtsd
which is also a 1-input operation.
32-bit mode:
Different compilers have different strategies. gcc -O3 -m32 -mfpmath=sse -msseregparm
is a good way to see what GCC does, making it return in XMM0 instead of ST0 so it only uses x87 when that's actually more convenient. (e.g. for 64-bit -> FP using fild
).
I put some u32 and u64 -> float or double test functions on Godbolt with gcc and clang, but this answer is mostly aiming to answer the x86-64 part of the question which other answers didn't cover well, not obsolete 32-bit codegen. So I'm not going to copy the code and asm here and dissect it.
I will mention that double
can exactly represent every u32
, which allows a simple (double)(int)(u32 - 2^31) + double(2^31)
trick to range-shift for signed conversion. But u32
->float
isn't so easy.