How much speed-up from converting 3D maths to SSE or other SIMD?

https://stackoverflow.com/questions/115291

02-07-2019
|

Question

I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code?

Solution

In my experience I typically see about a 3x improvement in taking an algorithm from x87 to SSE, and a better than 5x improvement in going to VMX/Altivec (because of complicated issues having to do with pipeline depth, scheduling, etc). But I usually only do this in cases where I have hundreds or thousands of numbers to operate on, not for those where I'm doing one vector at a time ad hoc.

OTHER TIPS

That's not the whole story, but it's possible to get further optimizations using SIMD, have a look at Miguel's presentation about when he implemented SIMD instructions with MONO which he held at PDC 2008,

_{(source: tirania.org)}

Picture from Miguel's blog entry.

For some very rough numbers: I've heard some people on ompf.org claim 10x speed ups for some hand-optimized ray tracing routines. I've also had some good speed ups. I estimate I got somewhere between 2x and 6x on my routines depending on the problem, and many of these had a couple of unnecessary stores and loads. If you have a huge amount of branching in your code, forget about it, but for problems that are naturally data-parallel you can do quite well.

However, I should add that your algorithms should be designed for data-parallel execution. This means that if you have a generic math library as you've mentioned then it should take packed vectors rather than individual vectors or you'll just be wasting your time.

E.g. Something like

namespace SIMD {
class PackedVec4d
{
  __m128 x;
  __m128 y;
  __m128 z;
  __m128 w;

  //...
};
}

Most problems where performance matters can be parallelized since you'll most likely be working with a large dataset. Your problem sounds like a case of premature optimization to me.

For 3D operations beware of un-initialized data in your W component. I've seen cases where SSE ops (_mm_add_ps) would take 10x normal time because of bad data in W.

The answer highly depends on what the library is doing and how it is used.

The gains can go from a few percent points, to "several times faster", the areas most susceptible of seeing gains are those where you're not dealing with isolated vectors or values, but multiple vectors or values that have to be processed in the same way.

Another area is when you're hitting cache or memory limits, which, again, requires a lot of values/vectors being processed.

The domains where gains can be the most drastic are probably those of image and signal processing, computational simulations, as well general 3D maths operation on meshes (rather than isolated vectors).

These days all the good compilers for x86 generate SSE instructions for SP and DP float math by default. It's nearly always faster to use these instructions than the native ones, even for scalar operations, so long as you schedule them correctly. This will come as a surprise to many, who in the past found SSE to be "slow", and thought compilers could not generate fast SSE scalar instructions. But now, you have to use a switch to turn off SSE generation and use x87. Note that x87 is effectively deprecated at this point and may be removed from future processors entirely. The one down point of this is we may lose the ability to do 80bit DP float in register. But the consensus seems to be if you are depending on 80bit instead of 64bit DP floats for the precision, your should look for a more precision loss-tolerant algorithm.

Everything above came as a complete surprise to me. It's very counter intuitive. But data talks.

Most likely you will see only very small speedup, if any, and the process will be more complicated than expected. For more details see The Ubiquitous SSE vector class article by Fabian Giesen.

The Ubiquitous SSE vector class: Debunking a common myth

Not that important

First and foremost, your vector class is probably not as important for the performance of your program as you think (and if it is, it's more likely because you're doing something wrong than because the computations are inefficient). Don't get me wrong, it's probably going to be one of the most frequently used classes in your whole program, at least when doing 3D graphics. But just because vector operations will be common doesn't automatically mean that they'll dominate the execution time of your program.

Not so hot

Not easy

Not now

Not ever

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow