Question

I am trying to understand vectorization but to my surprise this very simple code is not being vectorized

#define n 1024
int main () {
  int i, a[n], b[n], c[n];

  for(i=0; i<n; i++) { a[i] = i; b[i] = i*i; }
  for(i=0; i<n; i++) c[i] = a[i]+b[i];
}

While the Intel compiler vectorizes for some reason the initialization loop, line 5.

> icc -vec-report a.c
a.c(5): (col. 3) remark: LOOP WAS VECTORIZED

With GCC, I get nothing it seems

> gcc -ftree-vectorize -ftree-vectorizer-verbose=2 a.c

Am I doing something wrong? Shouldn't this be a very simple vectorizable loop? All the same operations, continuous memory etc. My CPU supports SSE1/2/3/4.

--- update ---

Following the answer below, this example works for me.

#include <stdio.h>
#define n 1024

int main () {
  int i, a[n], b[n], c[n];

  for(i=0; i<n; i++) { a[i] = i; b[i] = i*i; }
  for(i=0; i<n; i++) c[i] = a[i]+b[i];

  printf("%d\n", c[1023]);  
}

With icc

> icc -vec-report a.c
a.c(7): (col. 3) remark: LOOP WAS VECTORIZED
a.c(8): (col. 3) remark: LOOP WAS VECTORIZED

And gcc

> gcc -ftree-vectorize -fopt-info-vec -O a.c
a.c:8:3: note: loop vectorized
a.c:7:3: note: loop vectorized
Was it helpful?

Solution

I've slightly modified your source code to be sure that GCC couldn't remove the loops:

#include <stdio.h>
#define n 1024

int main () {
  int i, a[n], b[n], c[n];

  for(i=0; i<n; i++) { a[i] = i; b[i] = i*i; }
  for(i=0; i<n; i++) c[i] = a[i]+b[i];

  printf("%d\n", c[1023]);  
}

GCC (v4.8.2) can vectorize the two loops but it needs the -O flag:

gcc -ftree-vectorize -ftree-vectorizer-verbose=1 -O2 a.c

and I get:

Analyzing loop at a.c:8

Vectorizing loop at a.c:8

a.c:8 note: LOOP VECTORIZED. Analyzing loop at a.c:7

Vectorizing loop at a.c:7

a.c:7 note: LOOP VECTORIZED. a.c: note: vectorized 2 loops in function.

Using the -fdump-tree-vect switch GCC will dump more information in the a.c.##t.vect file (it's quite useful to get an idea of what is happening "inside").

Also consider that:

OTHER TIPS

Most of the time the options -Ofast -march=native will vectorize your code if it can be on your processor.

$ gcc compute_simple.c -Ofast -march=native -fopt-info-vec -o compute_simple.bin
compute_simple.c:14:5: note: loop vectorized
compute_simple.c:14:5: note: loop versioned for vectorization because of possible aliasing
compute_simple.c:14:5: note: loop vectorized

To know if your processor can do it, use lscpu and look at available flags.

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              12
...
Vendor ID:           GenuineIntel
...
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge  
 mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall   
nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl   
xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64   
monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1   
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand   
lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb   
stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1   
hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt   
xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify   
hwp_act_window hwp_epp md_clear flush_l1d

You need sse/avx on Intel, neon on ARM, others on AMD (like xop).

You can find many more information on vectorization by searching on gcc documentation.

Here is a nice article on the subject, with flags that can be used for many platforms: https://gcc.gnu.org/projects/tree-ssa/vectorization.html

Finaly, as written above, use -ftree-vectorizer-verbose=n in old versions of gcc, and -fopt-info-vec/-fopt-info-vec-missed in recent ones to see what is vectorized.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top