Question

I have a workflow that operates as follows:

  1. Load Initial Values
  2. Process Values to Intermediate Results A
  3. Process A to Intermediate Results B
  4. Process B to Intermediate Results C
  5. Process C and B to Intermediate Results D and E
  6. Sum Partial D to Final Result F

The natural structure for all of my intermediate results is that of a 2D array, which I am allocating with cudaMallocPitch().

Unfortunately, my algorithm requires that I keep D, E, C, and B in memory all at once, and D & E are, individually, 4x larger in memory than B. Because of another restriction in my processing (iterating over a graph structure in memory), the dimensions of A or B are bounded by the maximum dimensions of D and E, which in turn are determined by the memory usage of the initial values + B's memory consumption + C's memory consumption. This dependency is because I am "paging" sections of the intermediate results to/from device memory from the host (to accommodate very large problem sets), and I can't begin step 4 until steps 1-3 have been completed for the entire problem set.

Once I have B for the entire problem set, I can delete A.

I am currently determining the maximum size of D+E with the following function:

int gpuCalculateSimulPatterns(int lines, int patterns) {
    // get free memory
    size_t free_mem, total_mem;
    int allowed_patterns;
    cudaMemGetInfo(&free_mem, &total_mem);
    allowed_patterns = (free_mem - (lines*sizeof(int))) / (lines*(sizeof(int)*2.5) + lines*sizeof(char)*1.5);
    return min(patterns, allowed_patterns -(allowed_patterns % 32));
}

It "works", but only because I'm overestimating the size of D or E (their dimensions and memory usage are identical) by 25% and doubling the expected size of B. Even then, I still run into edge cases where my memory allocation fails because it's run out of memory. I want to make more efficient use of the memory on the card and maintain alignment, as my kernels make multiple reads and writes from/to global memory.

No, using shared memory is not an option, as I'm using multiple kernels across multiple blocks, and the threads within a block don't interact at all.

I've found that cudaMallocPitch() only returns the used pitch for memory that has been successfully allocated. Is there a way to hand the driver a 2D memory allocation request and just ask for the pitch it would assign?

I'd rig up a trial/error optimization routine, but the linked dependency of dimension between A, B, D, and E (C I calculate apriori as it is not allocated pitch-linear) makes this a lousy solution and it'd need to be recalculated for every problem set.

Does anyone have a better approach that would allow me to determine an appropriate size of my intermediate datasets that will fit in an arbitrary amount of device memory?

EDIT:

The memory for intermediate A is being reused, my bounding calculations makes the assumption that C+D+E+B >> Initial + A + B (which is true by virtue of the fact that A & B are 1-byte chars of the same dimensions, while C,D,E are ints) and such I only need to ensure there's enough space for B + C + D + E.

I'm only using Compute Capability 2.x cards to test this with (Quadro 2000, Tesla C2075, GTX460).

Was it helpful?

Solution

Pitch in bytes is calculated this way:

pitch_size = DIV_UP(width_in_bytes, prop.textureAlignment);

Where DIV_UP rounds the first argument up to the next higher multiple of the second argument.

You can query prop.textureAlignment by calling cudaGetDeviceProperties().

OTHER TIPS

Shouldn't the pitch in bytes be:

pitch_size = DIV_UP(width_in_bytes, prop.textureAlignment)*prop.textureAlignment;

instead of:

pitch_size = DIV_UP(width_in_bytes, prop.textureAlignment);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top