Help with Assembly/SSE Multiplication

https://stackoverflow.com/questions/2961504

23-10-2019
|

Question

I've been trying to figure out how to gain some improvement in my code at a very crucial couple lines:

float x = a*b;
float y = c*d;
float z = e*f;
float w = g*h;

all a, b, c... are floats.

I decided to look into using SSE, but can't seem to find any improvement, in fact it turns out to be twice as slow. My SSE code is:

Vector4 abcd, efgh, result;
abcd = [float a, float b, float c, float d];
efgh = [float e, float f, float g, float h];
_asm {
movups xmm1, abcd
movups xmm2, efgh
mulps xmm1, xmm2
movups result, xmm1
}

I also attempted using standard inline assembly, but it doesn't appear that I can pack the register with the four floating points like I can with SSE.

Any comments, or help would be greatly appreciated, I mainly need to understand why my calculations using SSE are slower than the serial C++ code?

I'm compiling in Visual Studio 2005, on a Windows XP, using a Pentium 4 with HT if that provides any additional information to assit.

Thanks in advance!

Solution

As you've found out, just replacing a couple of instructions with SSE is not going to work because you need to shuffle the data around in memory in order to load the SSE registers correctly, and this moving data around in memory (the bit that constructs the arrays) is going to kill your performance as memory is very slow (hard disk aside, memory is invariably the bottleneck these days).

Also, there is no way to move data between the SSE and the FPU/ALU without using a write to RAM followed by a read. Modern IA32 chips cope well with this particular pattern (write then read) but will still invalidate some cache which will have a knock on effect.

To get the best out of SSE you need to look at the whole algorithm and the data the algorithm uses. The values of a,b,c and d and e, f, g and h need to permanently in those arrays so that there is no shifting data around in memory prior to loading the SSE registers. It is not straightforward and may require a lot of reworking of your code and data (you may need to store the data differently on disk).

It might also be worth pointing out the SSE is only 32bit (or 64bit if you use doubles) whereas the FPU is 80bit (regardless of float or double) so you will get slightly different results when using SSE compared to using the FPU. Only you know if this will be an issue.

OTHER TIPS

you are using unaligned instructions, which are very slow. You may want to try aligning your data correctly, 16-byte boundary, and using movaps. You are better alternative is to use intrinsics, rather than assembly, because then compiler is free to order instructions as it seems necessary.

You can enable the use of SSE and SSE2 in the program options in newer VS versions and possibly in 2005. Compile using an express version?

Also, your code in SSE is probably slower because when you compile serial C++, the compiler is smart and does a very good job on making it very fast- for example, automatically putting them in the right registers at the right time. If the operations occur in serial, the compiler can reduce the impact of caching and paging, for example. Inline assembler however can be optimized poorly at best and should be avoided whenever possible.

In addition, you'd have to be performing a HUGE amount of work for SSE/2 to bring a notable benefit.

This is an old thread, but I noticed a mistake in your example. If you want to perform this:

float x = a*b;
float y = c*d;
float z = e*f;
float w = g*h;

Then the code should be like that:

Vector4 aceg, bdfh, result;  // xyzw
abcd = [float a, float c, float e, float g];
efgh = [float b, float d, float f, float h];
_asm {
movups xmm1, abcd
movups xmm2, efgh
mulps xmm1, xmm2
movups result, xmm1
}

And to gain even some more speed, I'd suggest that you don't use a separate register for "result".

For starters, not all algorithms will benefit being rewritten in SSE. Data-driven algorithms (like algorithms driven by look up tables) don't translate well into SSE because a lot of time is lost packing and unpacking data into vectors for SSE to operate.

Hope this still helps.

Firstly when you have something 128bit (16byte) aligned you should use MOVAPS as it can be much faster. The compiler should usually give you 16byte alignment, even on 32bit systems.

Your C/C++ lines don't do the same thing as your sse code.

The four floats in one xmm register are multiplied by the four floats in the other register. Giving you:

float x = a*e;
float y = b*f;
float z = c*g;
float w = d*h;

In sse1 you have to use SHUFPS to reorder the floats in both registers before multiplying.

Also for processing data that is bigger then the cpu cache you can use non-temporal stores (MOVNTPS) to reduce cache pollution. Note that non-temporal stores are a lot slower in other cases.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow