Frage

I write a program involves some matrix-vector multiplication and least-square solving all using cublas & cula API . The program will iterate many times . in each step I must set one matrix's particular row all to zero.

I tried to copy entire matrix(50*1000 or larger) into cpu and set one row to zero then copy matrix back, but it is too time-consuming because program will iterate 10 times or more. So I decide to write a kernel function.

The global function like this:

__global__ void Setzero(float* A, int index) /* A is the matrix and in col-major , index is the row I want to set zero */
{
    int ind=blockDim.x*blockIdx.x+threadIdx.x;
    if( ((ind%N)==index ) && (ind<50000) )  //notice matrix is in col-major ,matrix size is 50000
    {   
    A[ind]=0.0;
        ind+=blockDim.x*blockIdx.x;
    }
    else    ;
        __syncthreads();   
}

The question is when I do this(use cublas before call the function ):

cudaMalloc((void**)&A_Gpu_trans,sizeof(float)*50000);
cudaMemcpy(A_Gpu_trans,A_trans,sizeof(float)*M*N,cudaMemcpyHostToDevice);
cublasSgemv_v2(handle,CUBLAS_OP_N,1000,50,&al,A_Gpu_trans,1000,err_gpu,1,&beta,product,1);
dim3 dimBlock(16,1);
dim3 dimGrid((50000-1)/16+1,1);
Setzero<<<dimGrid,dimBlock>>>(A_Gpu_trans,Index);

It return the error:

a __host__ function("Setzero") redeclared with __global__.

and an other error:

MSB3721: command“"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" --use-local-env --cl-version 2010 -ccbin "D:\Program Files\Microsoft Visual Studio 10.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include" -G --keep-dir Debug -maxrregcount=0 --machine 32 --compile -cudart static -g -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj "C:\Users\Administrator\documents\visual studio 2010\Projects\OOmp\OOmp\kernel.cu"”return 2。

It is strange when I only use cublas & cula API I can get the right answer.

War es hilfreich?

Lösung 2

Although you have not shown it in your question, you have clearly got another host function called Setzero somewhere in your code. The simple solution is to rename the kernel to something else.

The underlying reason why the CUDA toolchain emits the error is because the Setzero<<< >>> kernel invocation syntax in the runtime API causes the CUDA front end to create a host function of the same name as the kernel with a matching argument list and substitute the kernel launch for a call to that function. This host function contains the necessary API calls to launch the kernel. By having another host function with the same name as the kernel, you defeat this process and cause the compilation error you see.

Andere Tipps

Also, your function is both wrong and wildly inefficient...

You can't have a syncthread call in a conditional like this, it will possibly lead to a hang. It also appears to be entirely unnecessary here.

More the to point, you are launching one thread for every matrix entry, and only 1/N of them actually do anything.

A better approach is to launch only threads corresponding to entries which will be set to zero. Something like this:

__global__ void Setzero(float* A, int index) 
{
  int ind=blockDim.x*blockIdx.x+threadIdx.x;
  if (ind < M)   
    A[index+N*ind]=0.0;
}

and you launch M threads (or rather, ceil(M/256) threadblocks of 256 threads threads each, or whatever block size you want).

E.g.:

int block_size = 256; // usually a good choice
int num_blocks = (M + block_size - 1) / block_size;
Setzero<<<num_blocks, block_size>>>(A, index);
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top