Question

In CUDA C Programming Guide, there is a part that says:

Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access (via a variable or a pointer) to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned (i.e., its address is a multiple of that size).

If this size and alignment requirement is not fulfilled, the access compiles to multiple instructions with interleaved access patterns that prevent these instructions from fully coalescing. It is therefore recommended to use types that meet this requirement for data that resides in global memory.

I am using a Java wrapper to use CUDA in my code (JCuda). I have defined my own float3 equivalent in Java (which is just a float[] array of interleaved x, y and z elements).
My question is, since the float3 that I have defined occupies 3 x sizeof(float) = 12 bytes and 12 bytes is not equal to the length of a word that CUDA fetches, should I manually add a padding element at the end and make it 16 bytes?

As a side question which is very related:
My kernel requires a pointer to float3 data, thus when I call it from Java, I pass it the float[] data that I have which contains all float3 elements in the Java side. Right now that my java float3's are not aligned, am I processing wrong values? I'm asking because in another part of the programming guide it says:

Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results (off by a few words), so special care must be taken to maintain alignment of the starting address of any value or array of values of these types. A typical case where this might be easily overlooked is when using some custom global memory allocation scheme, whereby the allocations of multiple arrays (with multiple calls to cudaMalloc()or cuMemAlloc()) is replaced by the allocation of a single large block of memory partitioned into multiple arrays, in which case the starting address of each array is offset from the block's starting address.

So does this mean that when my data are not aligned and I request a certain offset in that data, I am fetching wrong values?

Thanks in advance for the answers :-)

Était-ce utile?

La solution 2

So after some trial and error, it seems that using padded float3's definitely improves the performance of the program. Thus I decided to use both padded float3's and strided memory (using cudaMallocPitch).

However, I still have not heard a good answer for the second part of my question.

Autres conseils

There are two aspects to this question:

  1. What are the requirements for correct memory access ?
  2. How can one optimize the throughput of memory accesses ?

To the first item: As the CUDA documentation points out, in order to load and store data correctly, the address of each access must be evenly divisible by the size of the access. For example, an object of type float has a size of four bytes, so it must be accessed at an address that is a multiple of four. If the alignment requirement is violated, data will be read and stored incorrectly, that is, the data becomes garbled.

For built-in non-compound types, the required alignment is equal to the size of the type, this is called "natural alignment". For user-defined compound types, such as structs, the required alignment is the alignment of the largest component type. This applies to the user-defined float3 type in the question, which has a four-byte alignment requirement as the largest component is of type float. Programmers can increase the required alignment by use of the __align__() attribute. See: How to specify alignment for global device variables in CUDA

For built-in compound types, CUDA requires alignment that is equal to the size of the compound type. For example, objects of types int2 and float2 must be aligned on a 8-byte boundary, while objects of types float4 and double2 must be aligned to a 16-byte boundary.

To the second item: The GPU is able to perform aligned 4-byte, 8-byte, and 16-byte accesses, and in general, the wider each access the higher the overall memory throughput. A vastly simplified view of the GPU hardware is that there are fixed-sized queues inside the hardware that track each memory access. The wider each memory access, the larger the total amount of bytes that can be queued up for transfer, which in turn improves latency tolerance and overall memory throughput.

For this reason I would suggest switching, if possible, from a custom float3 type to the built-in float4 type. The former will cause data to be loaded in chunks of four bytes, while the latter allows data to be loaded in chunks of 16 bytes.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top