SSE-copy, AVX-copy and std::copy performance

Question 1

The problem is that your test does a poor job to migrate some factors in the hardware that make benchmarking hard. To test this, I've made my own test case. Something like this:

for blah blah:
    sleep(500ms)
    std::copy
    sse
    axv

output:

SSE: 1.11753x faster than std::copy
AVX: 1.81342x faster than std::copy

So in this case, AVX is a bunch faster than std::copy. What happens when I change to test case to..

for blah blah:
    sleep(500ms)
    sse
    axv
    std::copy

Notice that absolutely nothing changed, except the order of the tests.

SSE: 0.797673x faster than std::copy
AVX: 0.809399x faster than std::copy

Woah! how is that possible? The CPU takes a while to ramp up to full speed, so tests that are run later have an advantage. This question has 3 answers now, including an 'accepted' answer. But only the one with the lowest amount of upvotes was on the right track.

This is one of the reasons why benchmarking is hard and you should never trust anyone's micro-benchmarks unless they've included detailed information of their setup. It isn't just the code that can go wrong. Power saving features and weird drivers can completely mess up your benchmark. One time i've measured an factor 7 difference in performance by toggling a switch in the bios that less than 1% of notebooks offer.

Question 2

This is an very interesting question, but I believe non of the answers so far is correct because the question itself is so misleading.

The title should be changed to "How does one reach the theoretical memory I/O bandwidth ?"

No matter what instruction set is used, CPU is so much faster than RAM that pure block memory copy is 100% I/O bounded. And this explains why there is little difference between SSE and AVX performance.

For small buffers hot in L1D cache, AVX can copy significantly faster than SSE on CPUs like Haswell where 256b loads/stores really do use a 256b data path to L1D cache instead of splitting into two 128b operations.

Ironically, ancient X86 instruction rep stosq performs much better than SSE and AVX in terms of memory copy!

The article here explains how to saturate memory bandwidth really well and it has rich references to explore further as well.

See also Enhanced REP MOVSB for memcpy here on SO, where @BeeOnRope's answer discusses NT stores (and non-RFO stores done by rep stosb/stosq) vs. regular stores, and how single-core memory bandwidth is often limited by max concurrency / latency, not by the memory controller itself.

Question 3

Writing fast SSE is not as simple as using SSE operations in place of their non-parallel equivalents. In this case I suspect your compiler cannot usefully unroll the load/store pair and your time is dominated by stalls caused by using the output of one low-throughput operation (the load) in the very next instruction (the store).

You can test this idea by manually unrolling one notch:

//SSE-copy testing
start2 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
    auto _mas = mas;
    auto _tar = tar;
    for(; _mas!=mas+sz; _mas+=8, _tar+=8)
    {
       __m128 buffer1 = _mm_load_ps(_mas);
       __m128 buffer2 = _mm_load_ps(_mas+4);
       _mm_store_ps(_tar, buffer1);
       _mm_store_ps(_tar+4, buffer2);
    }
}

Normally when using intrinsics I disassemble the output and make sure nothing crazy is going on (you could try this to verify if/how the original loop got unrolled). For more complex loops the right tool to use is the Intel Architecture Code Analyzer (IACA). It's a static analysis tool which can tell you things like "you have pipeline stalls".

Question 4

I think this is because the measurement is not accurate for kinda short operations.

When measuring performance on Intel CPU

Disable "Turbo Boost" and "SpeedStep". You can to this on system BIOS.
Change Process/Thread priority to High or Realtime. This will keep your thread running.
Set Process CPU Mask to only one core. CPU Masking with Higher priority will minimize context switching.
use __rdtsc() intrinsic function. Intel Core series returns CPU internal clock counter with __rdtsc(). You will get 3400000000 counts/second from 3.4Ghz CPU. And __rdtsc() flushes all scheduled operations in CPU so it can measure timing more accurate.

This is my test-bed startup code for testing SSE/AVX codes.

    int GetMSB(DWORD_PTR dwordPtr)
    {
        if(dwordPtr)
        {
            int result = 1;
    #if defined(_WIN64)
            if(dwordPtr & 0xFFFFFFFF00000000) { result += 32; dwordPtr &= 0xFFFFFFFF00000000; }
            if(dwordPtr & 0xFFFF0000FFFF0000) { result += 16; dwordPtr &= 0xFFFF0000FFFF0000; }
            if(dwordPtr & 0xFF00FF00FF00FF00) { result += 8;  dwordPtr &= 0xFF00FF00FF00FF00; }
            if(dwordPtr & 0xF0F0F0F0F0F0F0F0) { result += 4;  dwordPtr &= 0xF0F0F0F0F0F0F0F0; }
            if(dwordPtr & 0xCCCCCCCCCCCCCCCC) { result += 2;  dwordPtr &= 0xCCCCCCCCCCCCCCCC; }
            if(dwordPtr & 0xAAAAAAAAAAAAAAAA) { result += 1; }
    #else
            if(dwordPtr & 0xFFFF0000) { result += 16; dwordPtr &= 0xFFFF0000; }
            if(dwordPtr & 0xFF00FF00) { result += 8;  dwordPtr &= 0xFF00FF00; }
            if(dwordPtr & 0xF0F0F0F0) { result += 4;  dwordPtr &= 0xF0F0F0F0; }
            if(dwordPtr & 0xCCCCCCCC) { result += 2;  dwordPtr &= 0xCCCCCCCC; }
            if(dwordPtr & 0xAAAAAAAA) { result += 1; }
    #endif
            return result;
        }
        else
        {
            return 0;
        }
    }

    int _tmain(int argc, _TCHAR* argv[])
    {
        // Set Core Affinity
        DWORD_PTR processMask, systemMask;
        GetProcessAffinityMask(GetCurrentProcess(), &processMask, &systemMask);
        SetProcessAffinityMask(GetCurrentProcess(), 1 << (GetMSB(processMask) - 1) );
    
        // Set Process Priority. you can use REALTIME_PRIORITY_CLASS.
        SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
    
        DWORD64 start, end;
        start = __rdtsc();
    // your code here.
        end = __rdtsc();
        printf("%I64d\n", end - start);
        return 0;
    }

Question 5

I think that your main problem/bottleneck is your _mm_malloc.

I highly suggest to use std::vector as your main data structure if you are concerned about locality in C++.

intrinsics are not exactly a "library", they are more like a builtin function provided to you from your compiler, you should be familiar with your compiler internals/docs before using this functions.

Also note that the fact that the AVX are a newer than SSE doesn't make the AVX faster, whatever you are planning to use, the number of cycles taken by an function is probably more important than the "avx vs sse" argument, for example see this answer.

Try with a POD int array[] or an std::vector.