Question

The function below calculates absolute value of 32-bit floating point value:

__forceinline static float Abs(float x)
{
    union {
        float x;
        int a;
    } u;
    //u.x = x;
    u.a &= 0x7FFFFFFF;
    return u.x;
}

union u declared in the function holds variable x, which is different from the x which is passed as parameter in the function. Is there any way to create a union with argument to the function - x?

Any reason the function above with uncommented line be executing longer than this one?

__forceinline float fastAbs(float a)
{
    int b= *((int *)&a) & 0x7FFFFFFF;
    return *((float *)(&b));
}

I'm trying to figure out best way to take Abs of floating point value in as little count of read/writes to memory as possible.

Was it helpful?

Solution

Looking at the disassembly of the code compiled in release mode the difference is quite clear! I removed the inline and used two virtual function to allow the compiler to not optimize too much and let us show the differences.

This is the first function.

013D1002  in          al,dx  
            union {
                float x;
                int a;
            } u;
            u.x = x;
013D1003  fld         dword ptr [x]   // Loads a float on top of the FPU STACK.
013D1006  fstp        dword ptr [x]   // Pops a Float Number from the top of the FPU Stack into the destination address.
            u.a &= 0x7FFFFFFF;
013D1009  and         dword ptr [x],7FFFFFFFh  // Execute a 32 bit binary and operation with the specified address.
            return u.x;
013D1010  fld         dword ptr [x]  // Loads the result on top of the FPU stack.
        }

This is the second function.

013D1020  push        ebp                       // Standard function entry... i'm using a virtual function here to show the difference.
013D1021  mov         ebp,esp
            int b= *((int *)&a) & 0x7FFFFFFF;
013D1023  mov         eax,dword ptr [a]         // Load into eax our parameter.
013D1026  and         eax,7FFFFFFFh             // Execute 32 bit binary and between our register and our constant.
013D102B  mov         dword ptr [a],eax         // Move the register value into our destination variable
            return *((float *)(&b));
013D102E  fld         dword ptr [a]             // Loads the result on top of the FPU stack.

The number of floating point operations and the usage of FPU stack in the first case is greater. The functions are executing exactly what you asked, so no surprise. So i expect the second function to be faster.

Now... removing the virtual and inlining things are a little different, is hard to write the disassembly code here because of course the compiler does a good job, but i repeat, if values are not constants, the compiler will use more floating point operation in the first function. Of course, integer operations are faster than floating point operations.

Are you sure that directly using math.h abs function is slower than your method? If correctly inlined, abs function will just do this!

00D71016  fabs  

Micro-optimizations like this are hard to see in long code, but if your function is called in a long chain of floating point operations, fabs will work better since values will be already in FPU stack or in SSE registers! abs would be faster and better optimized by the compiler.

You cannot measure the performances of optimizations running a loop in a piece of code, you must see how the compiler mix all together in the real code.

OTHER TIPS

For the first question, I'm not sure why you can't just what you want with an assignment. The compiler will do whatever optimizations that can be done.

In your second sample code. You violate strict aliasing. So it isn't the same.

As for why it's slower:

It's because CPUs today tend to have separate integer and floating-point units. By type-punning like that, you force the value to be moved from one unit to the other. This has overhead. (This is often done through memory, so you have extra loads and stores.)

In the second snippet: a which is originally in the floating-point unit (either the x87 FPU or an SSE register), needs to be moved into the general purpose registers to apply the mask 0x7FFFFFFF. Then it needs to be moved back.

In the first snippet: The compiler is probably smart enough to load a directly into the integer unit. So you bypass the FPU in the first stage.

(I'm not 100% sure until you show us the assembly. It will also depend heavily on whether the parameter starts off in a register or on the stack. And whether the output is used immediately by another floating-point operation.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top