There are two aspects to this question:
- What are the requirements for correct memory access ?
- How can one optimize the throughput of memory accesses ?
To the first item: As the CUDA documentation points out, in order to load and store data correctly, the address of each access must be evenly divisible by the size of the access. For example, an object of type float
has a size of four bytes, so it must be accessed at an address that is a multiple of four. If the alignment requirement is violated, data will be read and stored incorrectly, that is, the data becomes garbled.
For built-in non-compound types, the required alignment is equal to the size of the type, this is called "natural alignment". For user-defined compound types, such as structs, the required alignment is the alignment of the largest component type. This applies to the user-defined float3
type in the question, which has a four-byte alignment requirement as the largest component is of type float
. Programmers can increase the required alignment by use of the __align__()
attribute. See: How to specify alignment for global device variables in CUDA
For built-in compound types, CUDA requires alignment that is equal to the size of the compound type. For example, objects of types int2
and float2
must be aligned on a 8-byte boundary, while objects of types float4
and double2
must be aligned to a 16-byte boundary.
To the second item: The GPU is able to perform aligned 4-byte, 8-byte, and 16-byte accesses, and in general, the wider each access the higher the overall memory throughput. A vastly simplified view of the GPU hardware is that there are fixed-sized queues inside the hardware that track each memory access. The wider each memory access, the larger the total amount of bytes that can be queued up for transfer, which in turn improves latency tolerance and overall memory throughput.
For this reason I would suggest switching, if possible, from a custom float3
type to the built-in float4
type. The former will cause data to be loaded in chunks of four bytes, while the latter allows data to be loaded in chunks of 16 bytes.