Question

In a helpful but somewhat dated November, 2006 article on vectorizing code with vDSP, the author makes the statement:

Important to keep in mind is the fact that only operations with strides equal to one will deliver blazingly fast vectorized code.

Is this still true today? Even on the newer Intel processors with their more capable vector intrinsics?

I ask because I am in the process of writing some matrix math routines, and have just started down the path of switching them all to use Fortran-like column-major ordering in an effort to be more readily compatible with MATLAB, BLAS and LAPACK. But now I find some of my calls to vDSP need to work on vectors that are no longer contiguous…

At present these vDSP calls are the bottleneck routines that my code exercises. Not to say that this will always be the case, but for now at least I would hate to slow them down just to make calls to those other libraries simpler.

My most-frequently-called vDSP routine right now is vDSP_distancesq in case that makes a difference.

Was it helpful?

Solution 2

Yes.

In cases where there is good reason, we (Apple’s Vector and Numerics Group) can add optimizations for certain other strides. For example, for a stride of two, on some processor models, we would load vector blocks as normal but then use various permute instructions to extract just every other element. This would result in code that is not as fast as unit-stride code but would be faster than the current code. This is not often done because there are often other approaches that are better, such as copying strided data to a unit-stride buffer, performing several vDSP operations on the buffer, and copying the data back.

The case you describe does not seem like a good candidate for specializing for non-unit strides. If you are performing multiple vDSP_distancesq operations down consecutive columns of an array, it would be better to do them in parallel (multiple columns at once) instead of serial (all of one column, then all of another column,…). If you are doing only single vDSP_distancesq operations down isolated columns, there are other issues. Column operations on matrices are notorious for cache problems, especially if the number of bytes per row is a multiple of a sizable power of two. The operation might be bound by memory loads, so writing specialized code to optimize the calculations might have no gain.

OTHER TIPS

Still true. AFAIK (as of SSE4.x, and I don't think AVX changes this) SSE memory load instructions load contiguous blocks only.

You can vectorize with a stride of 2, though some additional shuffling operations are required.

It's really a matter of fitting in the same cache line, in order to load the entire SSE register from cache at once. (And the number of memory -> cache transfers is even more critical to performance).

In order to support scatter-gather SSE, it's not the SIMD instructions that would need a big update, but the cache and memory controllers.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top