문제

Problem :

I converted a MMX to code to corresponding SSE2 code. And I expected almost 1.5x-2x speedup. But both took exactly same time. Why is it?

Scenario:

I am learning SIMD instruction set and their performance comparison. I took an array operation such that, Z = X^2 + Y^2 where X and Y are large one dimensional array of type "char". The values of X and Y are restricted to be less than 10, so that Z is always <255 (1 Byte). ( Not to worry about any overflow).

I wrote its C++ code first, checked its time. Then wrote corresponding ASSEMBLY code (~3x speedup). Then I wrote its MMX code (~12x v/s C++). Then I converted MMX into SSE2 code and it takes exactly same speed as that of MMX code. Theoretically, in SSE2, I expected a speedup of ~2x compared to MMX.

For conversion from MMX to SSE2, I converted all mmx reg to xmm reg. Then changed a couple of movement instructions and so on.

My MMX and SSE codes are pasted here : https://gist.github.com/abidrahmank/5281486 (I don't want to paste them all here)

These functions are later called from main.cpp file where arrays are passed as arguments.

What I have done :

1 - I went through some optimization manuals from Intel and other websites. Main problem with SSE2 codes is the 16 _memory alignment. When I manually checked the addresses, they all are found to be 16 _memory aligned. But I used both MOVDQU and MOVDQA, but both gives the same result and no speedup compared to MMX.

2 - I went to debug mode and checked each register values with instructions executed. And they are being executed exactly same as I thought, ie 16 bytes are taken and resulting 16 bytes are outputted.

Resources :

I am using Intel Core i5 processor with Windows 7 and Visual C++ 2010.

Question :

So final question is, why there is no performance improvement for SSE2 code compared to MMX code ? Am I doing any thing wrong in SSE code ? Or is there any other explanation ?

도움이 되었습니까?

해결책

Harold’s comment was absolutely correct. The arrays that you are processing do not fit into cache on your machine, so your computation is entirely load store bound.

I timed the throughput of your computation on a current-generation i7 for various buffer lengths, and also the throughput of the same routine with everything except for the loads and stores removed:

throughput

What we observe here is that once the buffer gets so big that it is out of the L3 cache, the throughput of your computation exactly matches the achieved load/store bandwidth. This tells us that how you process the data makes essentially no difference (unless you make it significantly slower); the speed of computation is limited by the ability of the processor to move data to/from memory.

If you do your timing on smaller arrays, you will see a difference between your two implementations.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top