CUDA alignment requirement: should I change my data structures?

Question 1

So after some trial and error, it seems that using padded float3's definitely improves the performance of the program. Thus I decided to use both padded float3's and strided memory (using cudaMallocPitch).

However, I still have not heard a good answer for the second part of my question.

Question 2

There are two aspects to this question:

What are the requirements for correct memory access ?
How can one optimize the throughput of memory accesses ?

To the first item: As the CUDA documentation points out, in order to load and store data correctly, the address of each access must be evenly divisible by the size of the access. For example, an object of type float has a size of four bytes, so it must be accessed at an address that is a multiple of four. If the alignment requirement is violated, data will be read and stored incorrectly, that is, the data becomes garbled.

For built-in non-compound types, the required alignment is equal to the size of the type, this is called "natural alignment". For user-defined compound types, such as structs, the required alignment is the alignment of the largest component type. This applies to the user-defined float3 type in the question, which has a four-byte alignment requirement as the largest component is of type float. Programmers can increase the required alignment by use of the __align__() attribute. See: How to specify alignment for global device variables in CUDA

For built-in compound types, CUDA requires alignment that is equal to the size of the compound type. For example, objects of types int2 and float2 must be aligned on a 8-byte boundary, while objects of types float4 and double2 must be aligned to a 16-byte boundary.

To the second item: The GPU is able to perform aligned 4-byte, 8-byte, and 16-byte accesses, and in general, the wider each access the higher the overall memory throughput. A vastly simplified view of the GPU hardware is that there are fixed-sized queues inside the hardware that track each memory access. The wider each memory access, the larger the total amount of bytes that can be queued up for transfer, which in turn improves latency tolerance and overall memory throughput.

For this reason I would suggest switching, if possible, from a custom float3 type to the built-in float4 type. The former will cause data to be loaded in chunks of four bytes, while the latter allows data to be loaded in chunks of 16 bytes.