Question

So I stumbled upon something which I'd like to understand, as it's causing me headaches. I have the following code:

#include <stdio.h>
#include <smmintrin.h>

typedef union {
    struct { float x, y, z, w; } v;
    __m128 m;
} vec;

vec __attribute__((noinline)) square(vec a)
{
    vec x = { .m = _mm_mul_ps(a.m, a.m) };
    return x;
}

int main(int argc, char *argv[])
{
    float f = 4.9;
    vec a = (vec){f, f, f, f};
    vec res = square(a); // ?
    printf("%f %f %f %f\n", res.v.x, res.v.y, res.v.z, res.v.w);
    return 0;
}

Now, in my mind, the call to square in main should put the value of a in xmm0 so that the square function can do mulps xmm0, xmm0 and be done with it.

This is not what happens when I compile with clang or gcc. Instead, the first 8 bytes of a are put in xmm0 and the next 8 bytes in xmm1, making the square function a lot more complicated as it needs to patch things back up.

Any idea why?

NOTE: This is with -O3 optimization.

After further research, it seems like it has to do with the union type. If the function takes a straight __m128, the generated code will expect the value in a single register (xmm0). But given that they should both fit in xmm0, I don't see why it's being split in two half-used registers when the vec type is used..

Was it helpful?

Solution

The compiler is just trying to follow the calling convention as specified by the System V Application Binary Interface AMD64 Architecture Processor Supplement, section 3.2.3 Parameter Passing.

The relevant points are:

We first define a number of classes to classify arguments. The
classes are corresponding to AMD64 register classes and defined as:

SSE The class consists of types that fit into a vector register.

SSEUP The class consists of types that fit into a vector register and can
be passed and returned in the upper bytes of it.

The size of each argument gets rounded up to eightbytes.
The basic types are assigned their natural classes:
Arguments of types float, double, _Decimal32, _Decimal64 and __m64 are
in class SSE.

The classification of aggregate (structures and arrays) and union types
works as follows:

If the size of the aggregate exceeds a single eightbyte, each is
classified separately. 

Applying the above rules means that the x, y and z, w pairs of the embedded struct get separately classified as SSE class, which in turn means they must be passed in two separate registers. The presence of the m member in this case doesn't have any effect, you can even delete it.

OTHER TIPS

EDIT: on a second read through, I'm less certain why this is happening, but I'm more certain that this is where it is happening. I don't think this answer is right, but I'll leave it up as it may be helpful.

Speaking only for clang:

It seems like this is an issue that is just an unfortunate side effect of a compiler heuristic.

From a brief look at clang (file CGRecordLayoutBuilder.cpp, function CGRecordLowering::lowerUnion) it looks like llvm doesn't internally represent union types as such, and the types of a function don't get changed depending on the uses within the function.

clang looks at your function and sees that it needs 16 bytes worth of arguments for the type signature, then uses a heuristic to pick which type it thinks is best. It favors a { double, double } interpretation over a <4 x float> (which would give it the most efficiency in your case) because doubles are more lenient with respect to alignment.

I'm no expert on clang internals, so I could be very wrong, but it doesn't look like there's a particularly nice way around this one. If you want the optimized version you may have to use pointer casting instead of unions to get it.

The code I suspect is causing the problem:

void CGRecordLowering::lowerUnion() {
    ...
    // Conditionally update our storage type if we've got a new "better" one.
    if (!StorageType ||
        getAlignment(FieldType) >  getAlignment(StorageType) ||
        (getAlignment(FieldType) == getAlignment(StorageType) &&
        getSize(FieldType) > getSize(StorageType)))
      StorageType = FieldType;
    ...
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top