An Intel processor with SSE, AVX, or AVX-512 can have from 8 to 32 SIMD registers (see below). The number of registers also depends on if it's 32-bit code or 64-bit code as well. So when you call _mm_load_ps
the values are loaded into SIMD register. If all the registers are used then some will have to be spilled onto the stack.
Exactly like if you have a lot of int
or scalar float
variables and the compiler can't keep them all the currently "live" ones in registers - load/store intrinsics mostly just exist to tell the compiler about alignment, and as an alternative to pointer-casting onto other C data types. Not because they have to compile to actual loads or stores, or that those are the only ways for compilers to emit vector load or store instructions.
Processor with SSE
8 128-bit registers labeled XMM0 - XMM7 //32-bit operating mode
16 128-bit registers labeled XMM0 - XMM15 //64-bit operating mode
Processor with AVX/AVX2
8 256-bit registers labeled YMM0 - YMM7 //32-bit operating mode
16 256 bit registers labeled YMM0 - YMM15 //64-bt operating mode
Processor with AVX-512 (2015/2016 servers, Ice Lake laptop, ?? desktop)
8 512-bit registers labeled ZMM0 - ZMM31 //32-bit operating mode
32 512-bit registers labeled ZMM0 - ZMM31 //64-bit operating mode
Wikipedia has a good summary on this AVX-512.
(Of course, the compiler can only use x/y/zmm16..31 if you tell it it's allowed to use AVX-512 instructions. Having an AVX-512-capable CPU does you no good when running machine code compiled to work on CPUs with only AVX2.)