CUBLAS expects input and output matrices to be allocated on the device. So in your case you should create device copies of A
and A_Copy
using cudaMalloc
, and pass them to the function cublasSgeam
.
Also, by default, alpha
and beta
should also be allocated on the device but cublas provides an option to use host pointers for these variables. All you have to do is to change the pointer mode of cublas handle before calling cublasSgeam
.
cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_HOST);
Update:
You are getting zeros, because you are initializing A_Copy
with zeros and copying it to A_Copy_dev
which is used as the A input matrix to the cublas function. So basically, you provide zeros input and get zeros output.
In the second cudaMemcpy
call, instead of A_Copy
, you should copy A
to A_Copy_dev
like this:
cudaMemcpy(A_Copy_dev, A, ARRAY_BYTES, cudaMemcpyHostToDevice);
There is no need of A_Copy
in this code.