Why vectorizing the loop does not have performance improvement

Question 1

This original answer was valid back in 2013. As of 2017 hardware, things have changed enough that both the question and the answer are out-of-date.

See the end of this answer for the 2017 update.

Original Answer (2013):

Because you're bottlenecked by memory bandwidth.

While vectorization and other micro-optimizations can improve the speed of computation, they can't increase the speed of your memory.

In your example:

for(k = 0; k < LEN; k++)
    c[k] = a[k] * b[k];

You are making a single pass over all the memory doing very little work. This is maxing out your memory bandwidth.

So regardless of how it's optimized, (vectorized, unrolled, etc...) it isn't gonna get much faster.

A typical desktop machine of 2013 has on the order of 10 GB/s of memory bandwidth*.
Your loop touches 24 bytes/iteration.

Without vectorization, a modern x64 processor can probably do about 1 iteration a cycle*.

Suppose you're running at 4 GHz:

(4 * 10^9) * 24 bytes/iteration = 96 GB/s

That's almost 10x of your memory bandwidth - without vectorization.

*Not surprisingly, a few people doubted the numbers I gave above since I gave no citation. Well those were off the top of my head from experience. So here's some benchmarks to prove it.

The loop iteration can run as fast as 1 cycle/iteration:

We can get rid of the memory bottleneck if we reduce LEN so that it fits in cache.
(I tested this in C++ since it was easier. But it makes no difference.)

#include <iostream>
#include <time.h>
using std::cout;
using std::endl;

int main(){
    const int LEN = 256;

    double *a = (double*)malloc(LEN*sizeof(*a));
    double *b = (double*)malloc(LEN*sizeof(*a));
    double *c = (double*)malloc(LEN*sizeof(*a));

    int k;
    for(k = 0; k < LEN; k++){
        a[k] = rand();
        b[k] = rand();
    }

    clock_t time0 = clock();

    for (int i = 0; i < 100000000; i++){
        for(k = 0; k < LEN; k++)
            c[k] = a[k] * b[k];
    }

    clock_t time1 = clock();
    cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
}

Processor: Intel Core i7 2600K @ 4.2 GHz
Compiler: Visual Studio 2012
Time: 6.55 seconds

In this test, I ran 25,600,000,000 iterations in only 6.55 seconds.

6.55 * 4.2 GHz = 27,510,000,000 cycles
27,510,000,000 / 25,600,000,000 = 1.074 cycles/iteration

Now if you're wondering how it's possible to do:

2 loads
1 store
1 multiply
increment counter
compare + branch

all in one cycle...

It's because modern processors and compilers are awesome.

While each of these operations have latency (especially the multiply), the processor is able to execute multiple iterations at the same time. My test machine is a Sandy Bridge processor, which is capable of sustaining 2x128b loads, 1x128b store, and 1x256b vector FP multiply every single cycle. And potentially another one or two vector or integer ops, if the loads are memory source operands for micro-fused uops. (2 loads + 1 store throughput only when using 256b AVX loads/stores, otherwise only two total memory ops per cycle (at most one store)).

Looking at the assembly (which I'll omit for brevity), it seems that the compiler unrolled the loop, thereby reducing the looping overhead. But it didn't quite manage to vectorize it.

Memory bandwidth is on the order of 10 GB/s:

The easiest way to test this is via a memset():

#include <iostream>
#include <time.h>
using std::cout;
using std::endl;

int main(){
    const int LEN = 1 << 30;    //  1GB

    char *a = (char*)calloc(LEN,1);

    clock_t time0 = clock();

    for (int i = 0; i < 100; i++){
        memset(a,0xff,LEN);
    }

    clock_t time1 = clock();
    cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
}

Processor: Intel Core i7 2600K @ 4.2 GHz
Compiler: Visual Studio 2012
Time: 5.811 seconds

So it takes my machine 5.811 seconds to write to 100 GB of memory. That's about 17.2 GB/s.

And my processor is on the higher end. The Nehalem and Core 2 generation processors have less memory bandwidth.

Update March 2017:

As of 2017, things have gotten more complicated.

Thanks to DDR4 and quad-channel memory, it is no longer possible for a single thread to saturate memory bandwidth. But the problem of bandwidth doesn't necessarily go away. Even though bandwidth has gone up, processor cores have also improved - and there are more of them.

To put it mathematically:

Each core has a bandwidth limit X.
Main memory has a bandwidth limit of Y.
On older systems, X > Y.
On current high-end systems, X < Y. But X * (# of cores) > Y.

Back in 2013: Sandy Bridge @ 4 GHz + dual-channel DDR3 @ 1333 MHz

No vectorization (8-byte load/stores): X = 32 GB/s and Y = ~17 GB/s
Vectorized SSE* (16-byte load/stores): X = 64 GB/s and Y = ~17 GB/s

Now in 2017: Haswell-E @ 4 GHz + quad-channel DDR4 @ 2400 MHz

No vectorization (8-byte load/stores): X = 32 GB/s and Y = ~70 GB/s
Vectorized AVX* (32-byte load/stores): X = 64 GB/s and Y = ~70 GB/s

_{(For both Sandy Bridge and Haswell, architectural limits in the cache will limit bandwidth to about 16 bytes/cycle regardless of SIMD width.)}

So nowadays, a single thread will not always be able to saturate memory bandwidth. And you will need to vectorize to achieve that limit of X. But you will still hit the main memory bandwidth limit of Y with 2 or more threads.

But one thing hasn't changed and probably won't change for a long time: You will not be able to run a bandwidth-hogging loop on all cores without saturating the total memory bandwidth.

Question 2

As Mysticial already described, main-memory bandwidth limitations are the bottleneck for large buffers here. The way around this is to redesign your processing to work in chunks that fit in the cache. (Instead of multiplying a whole 200MiB of doubles, multiply just 128kiB, then do something with that. So the code that uses the output of the multiply will find it still in L2 cache. L2 is typically 256kiB, and is private to each CPU core, on recent Intel designs.)

This technique is called cache blocking, or loop tiling. It might be tricky for some algorithms, but the payoff is the difference between L2 cache bandwidth vs. main memory bandwidth.

If you do this, make sure the compiler isn't still generating streaming stores (movnt...). Those writes bypass the caches to avoid polluting it with data that won't fit. The next read of that data will need to touch main memory.

Question 3

EDIT: Modified the answer a lot. Also, please disregard most of what I wrote before about Mystical's answer not being entirely correct. Though, I still do not agree it being bottlenecked by memory, as despite doing a very wide variety of tests, I couldn't see any signs of the original code being bound by memory speed. Meanwhile it kept showing clear signs of being CPU-bound.

There can be many reasons. And since the reason[s] can be very hardware-dependent, I decided I shouldn't speculate based on guesses. Just going to outline these things I encountered during later testing, where I used a much more accurate and reliable CPU time measuring method and looping-the-loop 1000 times. I believe this information could be of help. But please take it with a grain of salt, as it's hardware dependent.

When using instructions from the SSE family, vectorized code I got was over 10% faster vs. non-vectorized code.
Vectorized code using SSE-family and vectorized code using AVX ran more or less with the same performance.
When using AVX instructions, non-vectorized code ran the fastest - 25% or more faster than every other thing I tried.
Results scaled linearly with CPU clock in all cases.
Results were hardly affected by memory clock.
Results were considerably affected by memory latency - much more than memory clock, but not nearly as much as CPU clock affected the results.

WRT Mystical's example of running nearly 1 iteration per clock - I didn't expect the CPU scheduler to be that efficient and was assuming 1 iteration every 1.5-2 clock ticks. But to my surprise, that is not the case; I sure was wrong, sorry about that. My own CPU ran it even more efficiently - 1.048 cycles/iteration. So I can attest to this part of Mystical's answer to be definitely right.

Question 4

Just in case a[] b[] and c[] are fighting for the L2 cache ::

#include <string.h> /* for memcpy */

 ...

 gettimeofday(&stTime, NULL);

    for(k = 0; k < LEN; k += 4) {
        double a4[4], b4[4], c4[4];
        memcpy(a4,a+k, sizeof a4);
        memcpy(b4,b+k, sizeof b4);
        c4[0] = a4[0] * b4[0];
        c4[1] = a4[1] * b4[1];
        c4[2] = a4[2] * b4[2];
        c4[3] = a4[3] * b4[3];
        memcpy(c+k,c4, sizeof c4);
        }

    gettimeofday(&endTime, NULL);

Reduces the running time from 98429.000000 to 67213.000000; unrolling the loop 8-fold reduces it to 57157.000000 here.