python - using ctypes and SSE/AVX SOMETIMES segfaults

Question 1

Allright, I tink I found a sultion, its not very elegant but it works at least! The should be a better way, anyone any suggestions?

extern "C"{
int foobar(float * ndarray1,float * ndarray2,int path_cnt)
 {
     float * test = (float*)_mm_malloc(path_cnt*sizeof(float),32);
     float * test2 = (float*)_mm_malloc(path_cnt*sizeof(float),32);
     //copy to aligned memory(this part is kinda stupid)
     for(int i=0;i<path_cnt;i++)
     {
        test[i] = stock[i];
        test2[i] = max_vola[i];

     }
     for(int i=0;i<path_cnt;i=i+8)
     {
         __m256 arr1                = _mm256_load_ps(&test1[i]);
         __m256 arr2                    = _mm256_load_ps(&test2[i]);
         __m256 add                 = _mm256_add_ps(arr1,arr2);
         _mm256_store_ps(&test1[i],add);
     }
  //and copy everything back!
   for(int i=0;i<path_cnt;i++)
    {
    stock[i] = test[i];   
    }
     return 0;
 }
}

Question 2

There are aligned and unaligned load instructions. The aligned ones will fault if you violate the alignment rules, but they are faster. The unaligned ones accept any address and do loads/shifts internally to get the data you want. You are using the aligned version, _mm256_load_ps and can just switch to the unaligned version _mm256_loadu_ps without any intermediate allocation.

A good vectorizing compiler will include a lead-in loop to reach an aligned address, then a body to work on aligned data, then a final loop to clean up any stragglers.