There are at least two gigantic, basic problems in this code, neither of which has anything to do with CUDA:
histSize = sizeof(unsigned int) * xMax/cellWidth * yMax/cellHeight * numColors;
//....
h = (unsigned int*) malloc(histSize);
//.....
for(i=0; i<histSize; i++)
h[i]=0; // <-- buffer oveflow
which is probably killing the program before it ever even gets to launch the kernel, and:
cudaMalloc( (void**) &dev_h, histSize );
// .......
cudaMemcpy(dev_h, h, size, cudaMemcpyHostToDevice); // buffer overflow
which would kill the CUDA context if the program ever got that far.
These are elementary mistakes and you haven't detected them because your only usage case is apparently a program which attempts to process a 150Mb input file and emit a large histogram from it, and your only method of detecting errors is looking at a file containing that histogram. That is a completely insane way to develop and debug code. If you had done any of the following:
- Hardcoded a trivially small test case you already knew the answers for
- Added CUDA API error checking
- Run valgrind
- Used cuda-memcheck
- Used a host debugger
- ran nvprof
you probably would have instantly detected the problems (there might well be more but I don't care enough to look for them, that is your job), and this Stack Overflow question wouldn't exist.