Question

I'm not exactly sure what happens when I call _mm_load_ps? I mean I know I load an array of 4 floats into a __m128, which I can use to do SIMD accelerated arithmetic and then store them back, but isn't this __m128 data type still on the stack? I mean obviously there aren't enough registers for arbitrary amounts of vectors to be loaded in. So these 128 bits of data are moved back and forth each time you use some SIMD instruction to make computations? If so, than what is the point of _mm_load_ps?

Maybe I have it all wrong?

Was it helpful?

Solution 2

An Intel processor with SSE, AVX, or AVX-512 can have from 8 to 32 SIMD registers (see below). The number of registers also depends on if it's 32-bit code or 64-bit code as well. So when you call _mm_load_ps the values are loaded into SIMD register. If all the registers are used then some will have to be spilled onto the stack.

Exactly like if you have a lot of int or scalar float variables and the compiler can't keep them all the currently "live" ones in registers - load/store intrinsics mostly just exist to tell the compiler about alignment, and as an alternative to pointer-casting onto other C data types. Not because they have to compile to actual loads or stores, or that those are the only ways for compilers to emit vector load or store instructions.


Processor with SSE

8  128-bit registers labeled XMM0 - XMM7  //32-bit operating mode
16 128-bit registers labeled XMM0 - XMM15 //64-bit operating mode

Processor with AVX/AVX2

8  256-bit registers labeled YMM0 - YMM7  //32-bit operating mode
16 256 bit registers labeled YMM0 - YMM15 //64-bt operating mode

Processor with AVX-512 (2015/2016 servers, Ice Lake laptop, ?? desktop)

8  512-bit registers labeled ZMM0 - ZMM31 //32-bit operating mode
32 512-bit registers labeled ZMM0 - ZMM31 //64-bit operating mode

Wikipedia has a good summary on this AVX-512.

(Of course, the compiler can only use x/y/zmm16..31 if you tell it it's allowed to use AVX-512 instructions. Having an AVX-512-capable CPU does you no good when running machine code compiled to work on CPUs with only AVX2.)

OTHER TIPS

In just the same way that an int variable may reside in a register or in memory (or even both, at different times), the same is true of an SSE variable such as __m128. If there are sufficient free XMM registers then the compiler will normally try to keep the variable in a register (unless you do something unhelpful, like take the address of the variable), but if there is too much register pressure then some variables may spill to memory.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top