Function crashes when using _mm_load_pd

Question 1

Your data is not guaranteed to be 16 byte aligned as required by SSE loads. Either use _mm_loadu_pd:

    a = _mm_loadu_pd(A);
    ...
    a = _mm_loadu_pd(A2ptr);
    b = _mm_loadu_pd(B2ptr);

or make sure that your data is correctly aligned where possible, e.g. for static or locals:

alignas(16) double A2[2], B2[2], C[2];    // C++11, or C11 with <stdalign.h>

or without C++11, using compiler-specific language extensions:

 __attribute__ ((aligned(16))) double A2[2], B2[2], C[2];   // gcc/clang/ICC/et al

__declspec (align(16))         double A2[2], B2[2], C[2];   // MSVC

You could use #ifdef to #define an ALIGN(x) macro that works on the target compiler.

Question 2

Let me try and answer why your code works in Linux and not Windows. Code compiled in 64-bit mode has the stack aligned by 16 bytes. However, code compiled in 32-bit mode is only 4 byte aligned on windows and is not guaranteed to be 16 byte aligned on Linux.

GCC defaults to 64-bit mode on 64-bit systems. However MSVC defaults to 32-bit mode even on 64-bit systems. So I'm going to guess that you did not compile your code in 64-bit mode in windows and _mm_load_pd and _mm_store_pd both need 16 byte aligned addresses so the code crashes.

You have at least three different solutions to get your code working in Windows as well.

Compile your code in 64 bit mode.
Use unaligned loads and stores (e.g. _mm_storeu_pd)
Align the data yourself as Paul R suggested.

The best solution is the third solution since then your code will work on 32 bit systems and on older systems where unaligned loads/stores are much slower.

Question 3

If you look at http://msdn.microsoft.com/en-us/library/cww3b12t(v=vs.90).aspx you can see that the function __mm_load_pd is defined as:

__m128d _mm_load_pd (double *p);

So, in your code A should be of type double, but A is of tipe T that is a template param. You should be sure that you are calling your SSE_vectormult function with the rights template params or just remove the template and use the double type instead,