Question

I'm using MSVC12 (Visual Studio 2013 Express) and I try to implemenent a fast multiplication of 8*8 float values. The problem is the alignment: The vector has actually 9*n values, but I always just need the first 8, so e.g. for n=0 the alignment of 32 bytes is guaranteed (when I use _mm_malloc), for n=1 the "first" value is aligned at 4*9 = 36 bytes.

for(unsigned i = 0; i < n; i++) {
    float *coeff_set = (float *)_mm_malloc(909 * 100 *sizeof(float), 32);
    // this works for n=0, not n=1, n=2, ...
    __m256 coefficients = _mm256_load_ps(&coeff_set[9 * i]);
    __m256 result = _mm256_mul_ps(coefficients, coefficients);

    ...
}

Is there any possibility to solve this? I would like to keep the structure of my data, but if not possible, I would change it. One solution I found was to copy the 8 floats first in an aligned array, and then load it, but the performance-loss is way too high then.

Was it helpful?

Solution

You have two choices:

  1. Pad each set of coefficients to 16 values to maintain alignment
  2. Use the _mm256_loadu_ps intrinsic for unaligned accesses

The first choice is more speed-efficient, while the second is more space-efficient.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top