Why not allocate a regular device array and pass that to CUBLAS?
A __constant__
array is not a normal __device__
array. In your code you are taking the address of the array and passing it to a host function. The address of the array on the host is not valid on the device, and vice versa, as described in the CUDA programming guide. See the CUDA Programming Guide:
The address obtained by taking the address of a
__device__
,__shared__
or__constant__
variable can only be used in device code. The address of a__device__
or__constant__
variable obtained through cudaGetSymbolAddress() as described in Device Memory can only be used in host code.
As for accessing __constant__
memory via a device pointer, see this answer for why it will be uncached.
Finally, using __constant__
memory in this way even if it is cached in the constant cache, is inefficient due to the access pattern. The constant cache is optimized for uniform access across threads in a warp. isamax
is likely to access different memory locations in every thread, and therefore the accesses will be serialized. Thus this will be 32x slower than accessing uniformly (and likely much slower than regular device memory).