Question

What is the best pattern to get a GPU efficiently calculate 'anti-functional' routines, that usually depend on positioned memory writes instead of reads? Eg. like calculating a histogram, sorting, dividing a number by percentages, merging data of differing size into lists etc. etc.

Was it helpful?

Solution

The established terms are gather reads and scatter writes

gather reads

This means that your program will write to a fixed position (like the target fragment position of a fragment shader), but has fast access to arbitrary data sources (textures, uniforms, etc.)

scatter writes

This means, that a program receives a stream of input data which it cannot arbitarily address, but can do fast writes to arbitrary memory locations.

Clearly the shader architecture of OpenGL is a gather system. Latest OpenGL-4 also allows some scatter writes in the fragment shader, but they're slow.

So what is the most efficient way, these days, to emulate "scattering" with OpenGL. So far this is using a vertex shader operating on pixel sized points. You send in as many points as you have data-points to process and scatter them in target memory by setting their positions accordingly. You can use geometry and tesselation shaders to yield the points processed in the vertex unit. You can use texture buffers and UBOs for data input, using the vertex/point index for addressing.

OTHER TIPS

GPU's are built with multiple memory types. One type is the DDRx RAM that is accessible to the host CPU and the GPU. In OpenCL and CUDA this called 'global' memory. For GPUs data in global memory must be transferred between the GPU and Host. It's usually arranged in banks to allow for pipelined memory access. So random reads/writes to 'global' memory are comparatively slow. The best way to access 'global' memory is sequentially.
It's size ranges from 1G - 6B per device.

The next type, of memory, is a on the GPU. It's shared memory that is available to a number of threads/warps within a compute unit/multi-processor. This is faster than global memory but not directly accessible from the host. CUDA calls this shared memory. OpenCL calls this local memory. This is the best memory to use for random access to arrays. For CUDA there is 48K and OpenCL there is 32K.

The third kind of memory are the GPU registers, called private in OpenCL or local in CUDA. Private memory is the fastest but there is less available than local/shared memory.

The best strategy to optimize for random access to memory is to copy data between global and local/shared memory. So a GPU application will copy portions its global memory to local/shared, do work using local/shared and copy the results back to global.

The pattern of copy to local, process using local and copy back to global is an important pattern to understand and learn to program well on GPUs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top