Vectorizing Modular Arithmetic

Question

Neither SSE2 nor AVX2 have integer division instructions. Intel is disingenuous to call the SVML functions intrinsics since many of them are complicated functions which map to several instructions and not just a few.

There is a way to do faster division (and modulo) with SSE2 or AVX2. See this paper Improved division by invariant integers. Basically you precompute a divisor and then do multiplication. Precomputing the divisor takes time but for some value of dim in your code it should win out. I described this method in more detail here SSE integer division? I also successfully implemented this method in a prime number finder Finding lists of prime numbers with SIMD - SSE/AVX

Agner Fog implements 32-bit (but not 64-bit) division in his Vector Class using the method described in that paper. That would be a good place to start if you want some code but you will have to extend it to 64-bit.

Edit: Based on Mysticial's comments and assuming that the inputs are already reduced I produced a version for SSE. If this is compiled in MSVC then it needs to be in 64 bit mode as 32 bit mode does not support _mm_set1_epi64x. This can be fixed for 32 bit mode mode but I don't want to do it.

#ifdef _MSC_VER 
#include <intrin.h>
#endif
#include <nmmintrin.h>                 // SSE4.2
#include <stdint.h>
#include <stdio.h>

void addRq_SSE(int64_t* a, const int64_t* b, const int32_t dim, const int64_t q) {
    __m128i q2 = _mm_set1_epi64x(q);
    __m128i t2 = _mm_sub_epi64(q2,_mm_set1_epi64x(1));
    for(int i = 0; i < dim; i+=2) {
        __m128i a2 = _mm_loadu_si128((__m128i*)&a[i]);
        __m128i b2 = _mm_loadu_si128((__m128i*)&b[i]);
        __m128i c2 = _mm_add_epi64(a2,b2);
        __m128i cmp = _mm_cmpgt_epi64(c2, t2);
        c2 = _mm_sub_epi64(c2, _mm_and_si128(q2,cmp));
        _mm_storeu_si128((__m128i*)&a[i], c2);
    }
}

int main() {
    const int64_t dim = 20;
    int64_t a[dim];
    int64_t b[dim];
    int64_t q = 10;

    for(int i=0; i<dim; i++) {
        a[i] = i%q; b[i] = i%q;
    }
    addRq_SSE(a, b, dim, q);
    for(int i=0; i<dim; i++) {
        printf("%d\n", a[i]);
    }   
}