How can I account for round-off errors in floating-point arithmetic for inverse trig (and sqrt) functions (in C)?

https://stackoverflow.com/questions/4171239

09-10-2019
|

Question

I have a fairly complicated function that takes several double values that represent two vectors in 3-space of the form (magnitude, latitude, longitude) where latitude and longitude are in radians, and an angle. The purpose of the function is to rotate the first vector around the second by the angle specified and return the resultant vector. I have already verified that the code is logically correct and works.

The expected purpose of the function is for graphics, so double precision is not necessary; however, on the target platform, trig (and sqrt) functions that take floats (sinf, cosf, atan2f, asinf, acosf and sqrtf specifically) work faster on doubles than on floats (probably because the instruction to calculate such values may actually require a double; if a float is passed, the value must be cast to a double, which requires copying it to an area with more memory -- i.e. overhead). As a result, all of the variables involved in the function are double precision.

Here is the issue: I am trying to optimize my function so that it can be called more times per second. I have therefore replaced the calls to sin, cos, sqrt, et cetera with calls to the floating point versions of those functions, as they result in a 3-4 times speed increase overall. This works for almost all inputs; however, if the input vectors are close to parallel with the standard unit vectors (i, j, or k), round-off errors for the various functions build up enough to cause later calls to sqrtf or inverse trig functions (asinf, acosf, atan2f) to pass arguments that are just barely outside of the domain of those functions.

So, I am left with this dilemma: either I can only call double precision functions and avoid the problem (and end up with a limit of about 1,300,000 vector operations per second), or I can try to come up with something else. Ultimately, I would like a way to sanitize the input to the inverse trig functions to take care of edge cases (it is trivial for do so for sqrt: just use abs). Branching is not an option, as even a single conditional statement adds so much overhead that any performance gains are lost.

So, any ideas?

Edit: someone expressed confusion over my using doubles versus floating point operations. The function is much faster if I actually store all my values in double-size containers (I.E. double-type variables) than if I store them in float-size containers. However, floating point precision trig operations are faster than double precision trig operations for obvious reasons.

Solution

Basically, you need to find a numerically stable algorithm that solves your problem. There are no generic solutions to this kind of thing, it needs to be done for your specific case using concepts such as the condition number if the individual steps. And it may in fact be impossible if the underlying problem is itself ill-conditioned.

OTHER TIPS

Single precision floating point inherently introduces error. So, you need to build your math so that all comparisons have a certain degree of "slop" by using an epsilon factor, and you need to sanitize inputs to functions with limited domains.

The former is easy enough when branching, eg

bool IsAlmostEqual( float a, float b ) { return fabs(a-b) < 0.001f; } // or
bool IsAlmostEqual( float a, float b ) { return fabs(a-b) < (a * 0.0001f); } // for relative error

but that's messy. Clamping domain inputs is a little trickier, but better. The key is to use conditional move operators, which in general do something like

float ExampleOfConditionalMoveIntrinsic( float comparand, float a, float b ) 
{ return comparand >= 0.0f ? a : b ; }

in a single op, without incurring a branch.

These vary depending on architecture. On the x87 floating point unit you can do it with the FCMOV conditional-move op, but that is clumsy because it depends on condition flags being set previously, so it's slow. Also, there isn't a consistent compiler intrinsic for cmov. This is one of the reasons why we avoid x87 floating point in favor of SSE2 scalar math where possible.

Conditional move is much better supported in SSE by pairing a comparison operator with a bitwise AND. This is preferable even for scalar math:

// assuming you've already used _mm_load_ss to load your floats onto registers 
__m128 fsel( __m128 comparand, __m128 a, __m128 b ) 
{
    __m128 zero = {0,0,0,0};
    // set low word of mask to all 1s if comparand > 0
    __m128 mask = _mm_cmpgt_ss( comparand, zero );  
    a = _mm_and_ss( a, mask );    // a = a & mask 
    b = _mm_andnot_ss( mask, b ); // b = ~mask & b
    return _mm_or_ss( a, b );     // return a | b
    }
}

Compilers are better, but not great, about emitting this sort of pattern for ternaries when SSE2 scalar math is enabled. You can do that with the compiler flag /arch:sse2 on MSVC or -mfpmath=sse on GCC.

On the PowerPC and many other RISC architectures, fsel() is a hardware opcode and thus usually a compiler intrinsic as well.

Have you looked at the Graphics Programming Black Book or perhaps handing the calculations off to your GPU?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow