Alignment, total size and SSE

Question 1

The size of the object and it's aligment are not the same thing. If the size of the struct is 16 bytes or some multiple it does not mean it will necessarily be 16 byte aligned.

In your case since your code is compiled in 64-bit mode you just need to pad the struct to 32 bytes. In 64-bit mode the stack is 16 byte aligned in Windows and Linux/Unix.

In 32-bit mode it does not have to be 16 byte aligned. You can test this. If you run the code below in MSVC in 32-bit mode you will likely see that the address for each element of the array is not 16 byte aligned (you might have to run it a few times). So even though the size of the struct is a multiple of 16 bytes it is not necessarily 16 byte aligned.

#include <stdio.h>

int main() { 
    union a {
        float data[4];
        struct {
            double x;
            double y;
            float z;
            float pad[3];
    };
    a b[10];
    for(int i=0; i<10; i++) {
        printf("%d\n", ((int)&b[i])%16);
    }
}

If you want your code to work in 32-bit mode as well then you should align the memory. If you run the code below in 32-bit mode on Windows or Linux you will see that it's always 16 byte aligned as well.

#include <stdio.h>
#ifdef _MSC_VER // If Microsoft compiler
#define Alignd(X) __declspec(align(16)) X
#else // Gnu compiler, etc.
#define Alignd(X) X __attribute__((aligned(16)))
#endif

int main() {
    union a {
        float data[4];
        struct {
            double x;
            double y;
            float z;
            float pad[3];
    };
    a Alignd(b[10]);
    for(int i=0; i<10; i++) {
        printf("%d\n", ((int)&b[i])%16);
    }
}

Question 2

In order to have a struct which has 2 doubles and a float, and be SSE aligned (16 bytes), use :

#pragma pack(1)
struct T
{
 double x,y;   // 16 bytes
 float z;      // 4 bytes
 char gap[12]; // 12 bytes
};

sizeof(T) will be 32, so if the first point is 16-bytes aligned, the whole vector will be aligned.

In order to make the first point aligned you should use __attribute((aligned(16)) for stack variables, or aligned_alloc for heap memory.

But, most of the algorithms of PCL are written and hard-coded for floats and not doubles, so they won't work...

Refer : pcl-users link