Practical use of automatic vectorization?

https://stackoverflow.com/questions/409329

03-07-2019
|

Question

Has anyone taken advantage of the automatic vectorization that gcc can do? In the real world (as opposed to example code)? Does it take restructuring of existing code to take advantage? Are there a significant number of cases in any production code that can be vectorized this way?

Solution

I have yet to see either GCC or Intel C++ automatically vectorize anything but very simple loops, even when given the code of algorithms that can (and were, after I manually rewrote them using SSE intrinsics) be vectorized.

Part of this is being conservative - especially when faced with possible pointer aliasing, it can be very difficult for a C/C++ compiler to 'prove' to itself that a vectorization would be safe, even if you as the programmer know that it is. Most compilers (sensibly) prefer to not optimize code rather than risking miscompiling it. This is one area where higher level languages have a real advantage over C, at least in theory (I say in theory since I'm not actually aware of any automatically vectorizing ML or Haskell compilers).

Another part of it is simply analytical limitations - most research in vectorization, I understand, is related to optimizing classical numerical problems (fluid dynamics, say) which was the bread and butter of most vector machines before a few years ago (when, between CUDA/OpenCL, Altivec/SSE, and the STI Cell, vector programming in various forms became widely available in commercial systems).

It's fairly unlikely that code written for a scalar processor in mind will be easy for a compiler to vectorize. Happily, many things you can do to make it easier for a compiler to understand how to vectorize it, like loop tiling and partial loop unrolling, also (tend to) help performance on modern processors even if the compiler doesn't figure out how to vectorize it.

OTHER TIPS

It is hard to use in any business logic, but gives speed ups when you are processing volumes of data in the same way.

Good example is sound/video processing where you apply the same operation to every sample/pixel. I have used VisualDSP for this, and you had to check the results after compiling - if it is really used where it should.

Vectorization will be primarily useful for numerical programs. Vectorized programs can run faster on vector processors like the STI Cell Processor used in PS3 Gaming consoles. There, the numerical computations used in, for example, rendering the game graphics can be speeded up a lot by vectorization. Such processors are called SIMD (Single Instruction Multiple Data) processors.

On other processors vectorization won't be used. Vectorized programs run on a vectorized instruction set which wont be applicable to a non-SIMD processor.

Intel's Nehalem series of processors (released late 2008) implement SSE 4.2 instructions, which are SIMD instructions. Source: wikipedia.

Vectorized instructions are not limited to Cell processors - most modern workstations-like CPU have them (PPC, x86 since pentium 3, Sparc, etc...). When used well for floating points operations, it can help quite a lot for very computing intensive tasks (filters, etc...). In my experience, automatic vectorization does not work so well.

You may have noticed that pretty much no-one actually knows how to make good use of GCC's Automatic Vectorization. If you search around the web to see people's comments, it always come to the idea that GCC allows you to enable automatic vectorization, but it extremely rarely makes actual use of it, and so if you want to use SIMD acceleration (eg: MMX, SSE, AVX, NEON, AltiVec), then you basically haveto figure out how to write it using compiler intrinsics or Assembly language code.

But the problem with intrinsics is that you effectively need to understand the Assembly language side of it and then also learn the Intrinsics method of describing what you want, which is likely to result in much less efficient code than if you wrote it in Assembly code (such as by a factor of 10x), because the compiler is still going to have trouble making good use of your intrinsic instructions!

For example, you might be using SIMD Intrinsics so that many operations can be performed in parallel at the same time, but your compiler will probably generate Assembly code that transfers the data between the SIMD registers and the normal CPU registers and back, effectively making your SIMD code run at a similar speed (or even slower) than normal code!

So basically:

If you want upto 100% speedups (2x speed), then either buy the official Intel/ARM compilers or convert some of your code to use SIMD C/C++ Intrinsics.
If you want 1000% speedups (10x speed), then write it in Assembly code using SIMD instructions by hand. Or if available on your hardware, use GPU acceleration instead such as OpenCL or Nvidia's CUDA SDK, since they can provide similar speedups in the GPU as SIMD does in the CPU.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow