Pergunta

I'm a graduate student in biophysics, trying to program a protein aggregation model using PyCUDA and Scipy's ODEInt. Within the past two weeks, I've gotten the code running, but it's very slow. Let me see if I can explain what my code does.

I have an np array of N concentrations with each element being the concentration of the i+1 length polymer. I have a function that calculates the rate of change of the polymer concentrations using CUDA where each kernel calculates the rate of change of one specific length polymer. During this calculation, an (N-i-1) length array needs to be summed by the thread, drastically slowing down my code.

Doing a little reading and Googling, I've come across parallel reduction as a way of invoking parallelism to make a serial calculation like an array sum go much faster. Of course I'm referring to Mark Harris' powerpoint slides. These were a great read and this looks like a potential way to drastically speed up my code, but I have a few questions :

If the number of polymer species, N, needs to be ~ 8700-9000, is it conceivable to use CUDA to reduce these N arrays at the same time? Doing a quick calculation (again possible thanks to SO's great explanation of how to calculate the maximum number of concurrent threads), I get for my GTX Titan that I can have 15 * 64 * 32 = 30720 threads running at once. If I invoke my kernel on ~8960 kernels at a time, I should only have 21760 threads left to use, correct? Since it seems that you need at least (length of the array/2) threads to properly reduce it, then I'm doomed.

I was thinking that perhaps I could use the remaining threads by dividing them up and reducing a few of the big arrays at a time in serial.

I don't know...I'm just a physics grad student. I thought I'd ask the professionals before I embarked on a long journey in the wrong direction. Is it possible to easily and efficiently tell a kernel to reduce something?

Thank you, Karsten

Here's a representation of what I'm trying to do.

fluxes and concs are np.arrays
dcdt(concs, t)
    Call CUDA to calculate fluxes
        Thread
        0       fluxes[i] = stuff + sum(concs[n] for n from 1 to 9000)
        1       fluxes[i] = stuff + sum(concs[n] for n from 2 to 9000)
        2       fluxes[i] = stuff + sum(concs[n] for n from 3 to 9000)
        ...
        N       fluxes[i] = stuff

You'll notice that the sum of the arrays that we've been talking about is basically a smaller version of the same array for each of the threads. This makes me wonder if this is something I should just do on the host.

Foi útil?

Solução

It's conceivable to use CUDA to reduce multiple arrays "in parallel". Reduction (summation) isn't a terribly compute-intensive operation, so if the data is not already resident on the GPU, then the cost to transfer the data to the GPU is likely to be a significant part (the majority) of the overall execution time. From your description, it's not clear if you're already doing this in some fashion on the GPU or if on the CPU. But if the data is on the GPU, then summing via parallel reduction will be fastest.

Unless the data of a single array is larger than ~2GB, then the number of threads is not likely to be an issue.

You could craft a kernel which simply reduces the arrays one after the other, in sequence. It seems you are saying there are N arrays, where N is around 9000. How big is each array? If the arrays are large enough, approximately all of the power of the GPU (roughly speaking) can be brought to bear on each individual operation, there's no significant penalty in that case to reducing the arrays, one after the other. The kernel then could be a basic parallel reduction, that looped over the arrays. Should be pretty straightforward.

If you have roughly 9000 arrays to crunch, and it's not difficult to order your data in an interleaved fashion, then you might also consider an array of 9000 threads, where each thread sums the elements of a single array in a serial loop, pretty much the way you'd do it naively on CPU code. Data organization would be critical here, because the goal of all of this is to maximize utilization of available memory bandwidth. As the loop in each thread is picking up it's next data element to be summed, you would want to ensure that you have contiguous data reads amongst threads in a warp (coalesced access), thus implying an interleaved data storage arrangment amongst your N arrays. If that were the case, this approach would run quite fast as well.

By the way, you might take a look at thrust which is relatively easy to use, and provides simple operations to do sum-reductions on arrays. As a prototype, it would be relatively easy to write a loop in thrust code that iteratively summed a sequence of arrays on the GPU.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top