Cuda cannot run more than 1024*255 Threads [closed]

https://stackoverflow.com/questions/23505807

16-07-2023
|

Question

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.

This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.

Closed 9 years ago.

Improve this question

I am running on Amazon's K520 GPU with 1500 cores and 4GB RAM. I am trying the run a kernel with 1024*850 threads. I know I can only get up to 1024 threads per block, but it surprised me when I could not launch more than 255 blocks using 1024 threads per block (I get a launch error). I thought the limit was 2^16 for grid sizes. When I ran an empty kernel though, it goes through it fine. That makes me think there is not enough memory somewhere. I wonder if I could get an explanation as to what is happening. Thanks. Here is the kernel:

__global__ void dotSubCentroidNorm
(
 Pt* segments,
 int pointCount,
 const Pt* centroids,
 const int* segmentChanges,
 float *dotResult
 )
{

  int idx = index();
  if(idx>=pointCount)
    return;
  int segment = segments[idx].segmentIndex;
  if(segment<0)
    return;
  int segPtCount = segmentChanges[segment+1]-segmentChanges[segment];
  Pt &pt = segments[idx];
  if(segPtCount==0)
  {
    printf("segment pt count =0 %d %d\n",idx, segment);
    return;
  }
  const Pt &ctr = centroids[segment];
  pt.x=pt.x-ctr.x/segPtCount;
  pt.y=pt.y-ctr.y/segPtCount;
  pt.z=pt.z-ctr.z/segPtCount;

  dotResult[idx] = pt.x*pt.x;
  dotResult[pointCount + idx] = pt.x*pt.y;
  dotResult[pointCount*2 + idx] = pt.x*pt.z;
  dotResult[pointCount*3 + idx] = pt.y*pt.y;
  dotResult[pointCount*4 + idx] = pt.y*pt.z;
  dotResult[pointCount*5 + idx] = pt.z*pt.z;
}

and the struct:

struct Pt
{
  float x,y,z;
  int segmentIndex;
};

I am calling this kernel with an array of about 400,000 Pt's for segments, and 200 Pt's for centroids, and 200 for segmentChanges, and 400,000*6 for dotResult. Here is the call:

....
thrust::device_vector<float> dotResult(pointCount*6);

printf("Errors1: %s \n",cudaGetErrorString(cudaGetLastError()));

int tpb = 1024; //threads per block
dim3 blocks = blkCnt(pointCount, tpb);
printf("blocks: %d %d\n", blocks.x, blocks.y);
dotSubCentroidNorm<<<blocks ,tpb>>>
  (
   segments,
   pointCount,
   thrust::raw_pointer_cast(centroids.data()),
   segmentChanges,
   thrust::raw_pointer_cast(dotResult.data())
  );
printf("Errors2: %s \n",cudaGetErrorString(cudaGetLastError()));
cudaThreadSynchronize();

printf("Errors3: %s \n",cudaGetErrorString(cudaGetLastError()));
....

 #define blkCnt(size, threadsPerBlock) dim3(min(255,(int)floor(1+(size)/(threadsPerBlock))),floor(1+(size)/(threadsPerBlock)/256))
#define index() (threadIdx.x + (((gridDim.x * blockIdx.y) + blockIdx.x)*blockDim.x))
....

Solution

So apparently I was passing in a host array for "segmentChanges" instead of a device, which is why it was crashing.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow