Your arrays _A
and _B
are most likely too large to fit on the stack. A quick-n-dirty fix is to move the arrays out to global scope. A better fix is to allocate them dynamically using new and delete as follows:
Complex *_A = new Complex[m*k];
Complex *_B = new Complex[k*n];
...
delete [] _A;
delete [] _B;
An even better option, since you're using C++, is to use a std::vector:
std::vector < Complex > _A(m*k);
std::vector < Complex > _B(k*n);
// But now to get the pointer you need this:
cudaMemcpy( A, &_A[0], (m*k)*sizeof(Complex), cudaMemcpyHostToDevice );
// etc.
That &_A[0]
syntax means: take the address of the first element of the vector, which is the same as a pointer to the entire array. The reason to prefer a vector over manually allocating the memory is that destruction/deallocation happens automatically when the variable goes out of scope, which is essential for writing exception-safe code.
You'll also need #include <vector>