Question

I'm trying to understand the impact of strict aliasing on performance in C99. My goal is to optimize a vector dot product, which takes up a large amount of time in my program (profiled it!). I thought that aliasing could be the problem, but the following code doesn't show any substantial difference between the standard approach and the strict aliasing version, even with vectors of size 100 million. I've also tried to use local variables to avoid aliasing, with similar results.

What's happening?

I'm using gcc-4.7 on OSX 10.7.4. Results are in microseconds.

$ /usr/local/bin/gcc-4.7 -fstrict-aliasing -Wall -std=c99 -O3 -o restrict restrict.c
$ ./restrict
sum:    100000000   69542
sum2:   100000000   70432
sum3:   100000000   70372
sum4:   100000000   69891
$ /usr/local/bin/gcc-4.7 -Wall -std=c99 -O0 -fno-strict-aliasing -o restrict restrict.c
$ ./restrict
sum:    100000000   258487
sum2:   100000000   261349
sum3:   100000000   258829
sum4:   100000000   258129

restrict.c (note this code will need several hundred MB RAM):

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <unistd.h>

/* original */
long sum(int *x, int *y, int n)
{
   long i, s = 0;

   for(i = 0 ; i < n ; i++)
      s += x[i] * y[i];

   return s;
}

/* restrict */
long sum2(int *restrict x, int *restrict y, int n)
{
   long i, s = 0;

   for(i = 0 ; i < n ; i++)
      s += x[i] * y[i];

   return s;
}

/* local restrict */
long sum3(int *x, int *y, int n)
{
   int *restrict xr = x;
   int *restrict yr = y;
   long i, s = 0;

   for(i = 0 ; i < n ; i++)
      s += xr[i] * yr[i];

   return s;
}

/* use local variables */
long sum4(int *x, int *y, int n)
{
   int xr, yr;
   long i, s = 0;

   for(i = 0 ; i < n ; i++)
   {
      xr = x[i];
      yr = y[i];
      s += xr * yr;
   }

   return s;
}

int main(void)
{
   struct timeval tp1, tp2;
   struct timezone tzp;

   long i, n = 1e8L, s;
   int *x = malloc(sizeof(int) * n);
   int *y = malloc(sizeof(int) * n);
   long elapsed1;

   for(i = 0 ; i < n ; i++)
      x[i] = y[i] = 1;

   gettimeofday(&tp1, &tzp);
   s = sum(x, y, n);
   gettimeofday(&tp2, &tzp);
   elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
      + (tp2.tv_usec - tp1.tv_usec);
   printf("sum:\t%ld\t%ld\n", s, elapsed1);

   gettimeofday(&tp1, &tzp);
   s = sum2(x, y, n);
   gettimeofday(&tp2, &tzp);
   elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
      + (tp2.tv_usec - tp1.tv_usec);
   printf("sum2:\t%ld\t%ld\n", s, elapsed1);

   gettimeofday(&tp1, &tzp);
   s = sum3(x, y, n);
   gettimeofday(&tp2, &tzp);
   elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
      + (tp2.tv_usec - tp1.tv_usec);
   printf("sum3:\t%ld\t%ld\n", s, elapsed1);

   gettimeofday(&tp1, &tzp);
   s = sum3(x, y, n);
   gettimeofday(&tp2, &tzp);
   elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
      + (tp2.tv_usec - tp1.tv_usec);
   printf("sum4:\t%ld\t%ld\n", s, elapsed1);

   return EXIT_SUCCESS;
}
Was it helpful?

Solution

Off the cuff:

  • with no strict aliasing rules, the compiler might simply generate optimized code that does subtly different things than intended.

  • It is not a given that disabling strict aliasing rules leads to faster code.

  • If it does, it's also not a given that the optimized code actually show different results. This depends a lot on the actual data access patterns, and often even the processor/cache architecture.

Regarding your example code, I'd say that aliasing is irrelevant (for emitted code, at least) since there is never any write access to the array elements inside the sumXXX functions.

(You might get slightly better performance (or opposite) if you pass the same vector twice. There might be a boon from hot cache and smaller cache footprint. There may be a penalty from redundant Loads putting the prefetch predictor off-track. As always: use a profiler)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top