Profiling CUDA code: Unexpected instruction counts on coalesced memory reads

Question

The basic reason for this is down to code generation by the compiler. PTX assembler has different virtual register state spaces for floating point and integer, and it (I think) isn't possible to perform a 64 bit load into two registers in different state spaces. So the compiler emits two 32 bit loads in the mixed integer/float struct, but can emit a 64 bit vector load into two registers in the float/float struct case.

This can be illustrated by considering the following model of your code:

struct __align__(8) ElementsB 
{
    float weight;
    float value;
};

struct __align__(8) ElementsA 
{
    float weight;
    int value;
};

template<typename T>
__global__ void kernel(const T* __restrict__ in, T* __restrict__ out, bool flag)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    T ival = in[idx];
    if (flag) {
        out[idx] = ival;
    }
}


template __global__ void kernel<ElementsA>(const ElementsA *, ElementsA *, bool);
template __global__ void kernel<ElementsB>(const ElementsB *, ElementsB *, bool);

Here we have the two structures you mentioned, and a simple templated kernel instantiated for both types. If we look at the PTX emitted by the compiler for sm_20 (CUDA 5.0 release compiler), the differences are obvious. For the ElementsA instance:

    ld.param.u32    %r4, [_Z6kernelI9ElementsAEvPKT_PS1_b_param_0];
    ld.param.u32    %r5, [_Z6kernelI9ElementsAEvPKT_PS1_b_param_1];
    ld.param.u8     %rc1, [_Z6kernelI9ElementsAEvPKT_PS1_b_param_2];
    cvta.to.global.u32      %r1, %r5;
    cvta.to.global.u32      %r6, %r4;
    .loc 2 16 1
    mov.u32         %r7, %ntid.x;
    mov.u32         %r8, %ctaid.x;
    mov.u32         %r9, %tid.x;
    mad.lo.s32      %r2, %r7, %r8, %r9;
    .loc 2 18 1
    shl.b32         %r10, %r2, 3;
    add.s32         %r11, %r6, %r10;
    ld.global.u32   %r3, [%r11+4];  // 32 bit integer load
    ld.global.f32   %f1, [%r11];  // 32 bit floating point load

(comments added for emphasis)

and for the Element B instance:

    ld.param.u32    %r3, [_Z6kernelI9ElementsBEvPKT_PS1_b_param_0];
    ld.param.u32    %r4, [_Z6kernelI9ElementsBEvPKT_PS1_b_param_1];
    ld.param.u8     %rc1, [_Z6kernelI9ElementsBEvPKT_PS1_b_param_2];
    cvta.to.global.u32      %r1, %r4;
    cvta.to.global.u32      %r5, %r3;
    .loc 2 16 1
    mov.u32         %r6, %ntid.x;
    mov.u32         %r7, %ctaid.x;
    mov.u32         %r8, %tid.x;
    mad.lo.s32      %r2, %r6, %r7, %r8;
    .loc 2 18 1
    shl.b32         %r9, %r2, 3;
    add.s32         %r10, %r5, %r9;
    ld.global.v2.f32        {%f9, %f10}, [%r10];  // 64 bit float2 load

The reason there is no performance penalty between the two is that the underlying hardware uses 128 byte fetches for coalesced warp level loads, and in both cases the transactions result in the same pair of 128 byte fetches.