Is using a stride of 1 still critical to vDSP performance today?

Question 1

Yes.

In cases where there is good reason, we (Apple’s Vector and Numerics Group) can add optimizations for certain other strides. For example, for a stride of two, on some processor models, we would load vector blocks as normal but then use various permute instructions to extract just every other element. This would result in code that is not as fast as unit-stride code but would be faster than the current code. This is not often done because there are often other approaches that are better, such as copying strided data to a unit-stride buffer, performing several vDSP operations on the buffer, and copying the data back.

The case you describe does not seem like a good candidate for specializing for non-unit strides. If you are performing multiple vDSP_distancesq operations down consecutive columns of an array, it would be better to do them in parallel (multiple columns at once) instead of serial (all of one column, then all of another column,…). If you are doing only single vDSP_distancesq operations down isolated columns, there are other issues. Column operations on matrices are notorious for cache problems, especially if the number of bytes per row is a multiple of a sizable power of two. The operation might be bound by memory loads, so writing specialized code to optimize the calculations might have no gain.

Question 2

Still true. AFAIK (as of SSE4.x, and I don't think AVX changes this) SSE memory load instructions load contiguous blocks only.

You can vectorize with a stride of 2, though some additional shuffling operations are required.

It's really a matter of fitting in the same cache line, in order to load the entire SSE register from cache at once. (And the number of memory -> cache transfers is even more critical to performance).

In order to support scatter-gather SSE, it's not the SIMD instructions that would need a big update, but the cache and memory controllers.