How many cycle does need for put a data into SIMD register?

Question

Short answer:

If you're doing many repeated 128-bit loads then it's possible to achieve two 128-bit loads per clock cycle with Sandy Bridge, Ivy Bridge and Haswell or one 128-bit load per clock cycle with Nahelem. Processors before Nahelem depend on if you do an aligned load or unaligned load.

Long answer:

Mystical gave you the information you need at Agner Fog's Instruction Tables. But let me spell it out for you (and myself).

The instructions you want to look at are: MOVDQU and MOVDQA with operands x, m128. These both will load 128-bits of data in one operation into a XMM/YMM register. MOVDQA requires that the address by 16 byte aligned. MOVDQU has no such restriction.

The two metrics you want to look at are latency and reciprocal throughput (lower is better). Two important changes happened to these metrics since Nahelem and Sandy Bridge:

Intel Processors before Nahelem had a higher latency and reciprocal throughput for MOVDQU. However, since Nahelem MOVDQU and MOVDQA have identical latency and reciprocal throughput.
All Intel processors since Sandy Bridge can do two 128-bit loads at the same time. This can be seen nicely at intels-haswell-architecture. You can see that in Nahelem only port 2 can do a 128-bit load whereas in Sandy Bridge and Haswell (and Ivy Bridge) they can do two 128-bit loads at the same time with port 2 and 3 (which is how they do one AVX load). So the reciprocal throughput for Nahelem is 1 whereas for Sandy Bridge it's 0.5.

However, even though MOVDQA and MOVDQU have identical latency and reciprocal throughput for each processor since Nahelem that does not mean they will give identical performance. If the address is not 16 byte aligned then the permanence may drop. You can test this with the code by ScottD at Successful compilation of SSE instruction with qmake (but SSE2 is not recognized) where I got about a 4% drop. I think this is due to cases where an address crosses a cache line (e.g. first 64-bits in one cache line and next 64-bits in another), otherwise the performance is equal. This effectively means there is no reason to use MOVDQA anymore since Nahelem. The only difference is in memory alignment.

Edit: I said that Haswell can do two 128-bit loads at the same time. In fact, it can do two 256-loads at the same time.

Edit: It turns out that with SSE unaligned load instructions cannot be folded with another operation. Folding allows the CPU to use micro-op fusion (though it does not mean it will fuse but without folding it certainly won't fuse). So it's not entirely accurate to say that aligned load instruction are obsolete since Nehalem. It's more accurate to say they are obsolete with AVX (which arrive with Sandy Bridge for Intel). Though, in practice not folding probably makes little difference except in some special cases.