Question

I try to find a proper material that clearly explains the different ways to write C/C++ source code that can be vectorized by the Intel compiler using array notation and elementary functions. All the materials online take trivial examples: saxpy, reduction etc. But there is a lack of explanation on how to vectorize a code that has conditional branching or contains a loop with loop-dependence.

For an example: say there is a sequential code I want to run with different arrays. A matrix is stored in major row format. The columns of the matrix is computed by the compute_seq() function:

#define N      256
#define STRIDE 256

__attribute__((vector))    
inline void compute_seq(float *sum, float* a) {
  int i;
  *sum = 0.0f;
  for(i=0; i<N; i++) 
    *sum += a[i*STRIDE];
}

int main() {
  // Initialize
  float *A = malloc(N*N*sizeof(float));
  float sums[N];
  // The following line is not going to be valid, but I would like to do somthing like this:
  compute_seq(sums[:],*(A[0:N:1]));
}

Any comments appreciated.

Was it helpful?

Solution

Here is a corrected version of the example.

__attribute__((vector(linear(sum),linear(a))))
inline void compute_seq(float *sum, float* a) {
  int i;
  *sum = 0.0f;
  for(i=0; i<N; i++)
    *sum += a[i*STRIDE];
}

int main() {
  // Initialize
  float *A = malloc(N*N*sizeof(float));
  float sums[N];
  compute_seq(&sums[:],&A[0:N:N]);
}

The important change is at the call site. The expression &sums[:] creates an array section consisting of &sums[0], &sums[1], &sums[2], ... &sums[N-1]. The expression &A[0:N:N] creates an array section consisting of &A[0*N], &A[1*N], &A[2*N], ...&A[(N-1)*N].

I added two linear clauses to the vector attribute to tell the compiler to generate a clone optimized for the case that the arguments are arithmetic sequences, as they are in this example. For this example, they (and the vector attribute) are redundant since the compiler can see both the callee and call site in the same translation unit and figure out the particulars for itself. But if compute_seq were defined in another translation unit, the attribute might help.

Array notation is a work in progress. icc 14.0 beta compiled my example for Intel(R) Xeon Phi(TM) without complaint. icc 13.0 update 3 reported that it couldn't vectorize the function ("dereference too complex"). Perversely, leaving the vector attribute off shut up the report, probably because the compiler can vectorize it after inlining.

I use the compiler option "-opt-assume-safe-padding" when compiling for Intel(R) Xeon Phi(TM). It may improve vector code quality. It lets the compiler assume that the page beyond any accessed address is safe to touch, thus enabling certain instruction sequences that would otherwise be disallowed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top