Question

I have a Monte Carlo step in Cuda that I need a help with. I already wrote the serial code, and it works as expected. Let's say I have a 256 particles, which are stored in

vector< vector<double> > *r;

Each i in r has (x,y) component both of which are double. Here, r is the position of a particle.

Now, in CUDA, I'm supposed to assign this vector in Host, and send it to Device. Once in device, these particles need to interact with each other. Each thread is supposed to run a Monte Carlo Sweep. How do I allocate memories, reference/dereference pointers using cudaMalloc, which functions to make global/shared,...---I just can't wrap my head around it.

Here's what my memory allocation looks at the moment::

cudaMalloc((void**)&r, (blocks*threads)*sizeof(double));    
CUDAErrorCheck();
kernel <<<blocks, threads>>> (&r, randomnums);
cudaDeviceSynchronize();
CUDAErrorCheck();
cudaMemcpy(r, blocks*threads*sizeof(double), cudaMemcpyDeviceToHost);

The above code is at potato level. I guess, I'm not sure what to do---even conceptually. My main problem is on allocating memories, and passing information to and from device & host. The vector r needs to be allocated, copied from host to device, do something with it in device, and copy it back to host. Any help/"pointers" will be much appreciated.

Was it helpful?

Solution

Your "potato level" code demonstrates a general lack of understanding of CUDA, including but not limited to the management of the r data. I would suggest that you increase your knowledge of CUDA by taking advantage of some of the educational resources available, and then develop an understanding of at least one basic CUDA code, such as the vector add sample. You will then be much better able to frame questions and understand the responses you receive. An example:

This would almost never make sense:

    cudaMalloc((void**)&r, (blocks*threads)*sizeof(double));    
    CUDAErrorCheck();
    kernel <<<blocks, threads>>> (&r, randomnums);

You either don't know a very basic concept that data must be transferred to the device (via cudaMemcpy) before it can be used by a GPU kernel, or you can't be bothered to write "potato level" code that makes any sense at all - which would suggest to me a lack of effort in writing a sensible question. Also, regardless of what r is, passing &r to a cuda kernel would never make sense, I don't think.

Regarding your question about how to move r back and forth:

  1. The first step in solving your problem will be to recast the r position data as something that is easily usable by a GPU kernel. In general, vector is not that useful for ordinary CUDA code and vector< vector< > > even less so. And if you have pointers floating about (*r) even less so. Therefore, flatten (copy) your position data into one or two dynamically allocated 1-D arrays of double:

    #define N 1000 
    ...
    vector< vector<double> > r(N);
    ...
    double *pos_x_h, *pos_y_h, *pos_x_d, *pos_y_d;
    pos_x_h=(double *)malloc(N*sizeof(double));
    pos_y_h=(double *)malloc(N*sizeof(double));
    for (int i = 0; i<N; i++){
      vector<double> temp = r[i];
      pos_x_h[i] = temp[0];
      pos_y_h[i] = temp[1];}
    
  2. Now you can allocate space for the data on the device and copy the data to the device:

    cudaMalloc(&pos_x_d, N*sizeof(double));
    cudaMalloc(&pos_y_d, N*sizeof(double));
    cudaMemcpy(pos_x_d, pos_x_h, N*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(pos_y_d, pos_y_h, N*sizeof(double), cudaMemcpyHostToDevice);
    
  3. Now you can properly pass the position data to your kernel:

    kernel<<<blocks, threads>>>(pos_x_d, pos_y_d, ...);
    
  4. Copying the data back after the kernel will be approximately the reverse of the above steps. This will get you started:

    cudaMemcpy(pos_x_h, pos_x_d, N*sizeof(double), cudaMemcpyDeviceToHost);
    cudaMemcpy(pos_y_h, pos_y_d, N*sizeof(double), cudaMemcpyDeviceToHost);
    

There are many ways to skin the cat, of course, the above is just an example. However the above data organization will be well suited to a kernel/thread strategy that assigns one thread to process one (x,y) position pair.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top