There are several problems here. It is probably easier to start by showing the "correct" way to use those two constant arrays, then explain why what you did doesn't work. So the kernel should look like this:
__global__ void kernel(int *X, int *out, int N)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if( tid<N )
{
out[tid] = A[tid%3000]*X[tid] + B[tid%3000];
}
}
ie. don't try passing A and B to the kernel. The reasons are as follows:
- Somewhat confusingly,
A
andB
in host code are not valid device memory addresses. They are host symbols which provide hooks into a runtime device symbol lookup. It is illegal to pass them to a kernel- If you want their device memory address, you must usecudaGetSymbolAddress
to retrieve it at runtime. - Even if you did call
cudaGetSymbolAddress
and retrieve the symbols device addresses in constant memory, you shouldn't pass them to a kernel as an argument, because doing do would not yield uniform memory access in the running kernel. Correct use of constant memory requires the compiler to emit special PTX instructions, and the compiler will only do that when it knows that a particular global memory location is in constant memory. If you pass a constant memory address by value as an argument, the __constant__ property is lost and the compiler can't know to produce the correct load instructions
Once you get this working, you will find it is terribly slow and if you profile it you will find that there is very high degrees of instruction replay and serialization. The whole idea of using constant memory is that you can exploit a constant cache broadcast mechanism in cases when every thread in a warp accesses the same value in constant memory. Your example is the complete opposite of that - every thread is accessing a different value. Regular global memory will be faster in such a use case. Also be aware that the performance of the modulo operator on current GPUs is poor, and you should avoid it wherever possible.