Recall that CUBLAS uses the column-major storage convention. Assuming these matrices are not part of some larger matrix, the leading dimension of cA is M, the leading dimension of cB is K, and the leading dimension of cAout is M. Your SGEMM call should therefore read
HANDLE_ERROR(cublasSgemm(hdl, CUBLAS_OP_N, CUBLAS_OP_N, M, K, N, &alpha, cA, M, cB, K, &beta, cAout, M));