Question

I'm looking for a commonly understandable notation to define a fixed point number representation. The notation should be able to define both a power-of-two factor (using fractional bits) and a generic factor (sometimes I'm forced to use this, though less efficient). And also an optional offset should be defined.
I already know some possible notations, but all of them seem to be constrained to specific applications.

  • For example the Simulink notation would perfectly fit my needs, but it's known only in the Simulink world. Furthermore the overloaded usage of the fixdt() function is not so readable.

  • TI defines a really compact Q Formats, but the sign is implicit, and it doesn't manage a generic factor (i.e. not a power-of-two).

  • ASAM uses a generic 6-coefficient rational function with 2nd-degree numerator and denominator polynomials (COMPU_METHOD). Very generic, but not so friendly.

See also the Wikipedia discussion.

The question is only about the notation (not efficiency of the representation nor fixed-point manipulation). So it's a matter of code readability, maintenability and testability.

Was it helpful?

Solution

Ah, yes. Having good naming annotations is absolutely critical to not introducing bugs with fixed point arithmetic. I use an explicit version of the Q notation which handles any division between M and N by appending _Q<M>_<N> to the name of the variable. This also makes it possible to include the signedness as well. There are no run-time performance penalties for this. Example:

uint8_t length_Q2_6;                // unsigned, 2 bit integer, 6 bit fraction
int32_t sensor_calibration_Q10_21;  // signed (1 bit), 10 bit integer, 21 bit fraction.

/*
 * Calculations with the bc program (with '-l' argument):
 *
 * sqrt(3)
 * 1.73205080756887729352
 *
 * obase=16
 * sqrt(3)
 * 1.BB67AE8584CAA73B0
 */
const uint32_t SQRT_3_Q7_25 = 1 << 25 | 0xBB67AE85U >> 7; /* Unsigned shift super important here! */

In case someone have not fully understood why such annotation is extremely important, Can you spot the if there is an bug in the following two examples?

Example 1:

speed_fraction = fix32_udiv(25, speed_percent << 25, 100 << 25);
squared_speed  = fix32_umul(25, speed_fraction, speed_fraction);
tmp1 = fix32_umul(25, squared_speed, SQRT_3);
tmp2 = fix32_umul(12, tmp1 >> (25-12), motor_volt << 12);

Example 2:

speed_fraction_Q7_25 = fix32_udiv(25, speed_percent << 25, 100 << 25);
squared_speed_Q7_25  = fix32_umul(25, speed_fraction_Q7_25, speed_fraction_Q7_25);
tmp1_Q7_25  = fix32_umul(25, squared_speed_Q7_25, SQRT_3_Q1_31);
tmp2_Q20_12 = fix32_umul(12, tmp1_Q7_25 >> (25-12), motor_volt << 12);

Imagine if one file contained #define SQRT_3 (1 << 25 | 0xBB67AE85U >> 7) and another file contained #define SQRT_3 (1 << 31 | 0xBB67AE85U >> 1) and code was moved between those files. For example 1 this has a high chance of going unnoticed and introduce the bug that is present in example 2 which here is done deliberately and has a zero chance of being done accidentally.

OTHER TIPS

Actually Q format is the most used representation in commercial applications: you use is when you need to deal with fractional numbers FAST and your processor does not have a FPU (floating point unit) is it cannot use float and double data types natively - it has to emulate instructions for them which are very expensive.

usually you use Q format to represent only the fractional part, though this not a must, you get more precision for your representation. Here's what you need to consider:

  • number of bits you use (Q15 uses 16 bitdata types, usually short int)
  • the first bit is the sign bit (out of 16 bits you are left with 15 for data value)
  • the rest of the bits are used to store the fractional part of your number.
  • since you are representing fractional numbers your value is somewhere in [0,1)
  • you can choose to use some bits for the integer part as well, but you would loose precision - e.g if you wanted to represent 3.3 in Q format, you would need 1 bit for sign, 2 bits for the integer part, and are left with 13 bits for the fractional part (assuming you are using 16 bits representation)-> this format is called 2Q13

Example: Say you want to represent 0.3 in Q15 format; you apply the Rule of Three:

      1 = 2^15 = 32768 = 0x8000
    0.3 = X
    -------------
    X = 0.3*32768 = 9830 = 0x666

You lost precision by doing this but at least the computation is fast now.

In C, you can't use a user defined type like a builtin one. If you want to do that, you need to use C++. In that language you can define a class for your fixed point type, overload all the arithmetic operators (+, -, *, /, %, +=, -=, *=, /=, %=, --, ++, cast to other types), so that usage of the instances of this class really behave like the builtin types.

In C, you need to do what you want explicitly. There are two basic approaches.


Approach 1: Do the fixed point adjustments in the user code.
This is overhead-free, but you need to remember to do the correct adjustments. I believe, it is easiest to just add the number of past point bits to the end of the variable name, because the type system won't do you much good, even if you typedef'd all the point positions you use. Here is an example:

int64_t a_7 = (int64_t)(7.3*(1<<7));    //a variable with 7 past point bits
int64_t b_5 = (int64_t)(3.78*(1<<5));   //a variable with 5 past point bits
int64_t sum_7 = a_7 + (b_5 << 2);    //to add those two variables, we need to adjust the point position in b
int64_t product_12 = a_7 * b_5;    //the product produces a number with 12 past point bits

Of course, this is a lot of hassle, but at least you can easily check at every point whether the point adjustment is correct.


Approach 2: Define a struct for your fixed point numbers and encapsulate the arithmetic on it in a bunch of functions. Like this:

typedef struct FixedPoint {
    int64_t data;
    uint8_t pointPosition;
} FixedPoint;

FixedPoint fixed_add(FixedPoint a, FixedPoint b) {
    if(a.pointPosition >= b.PointPosition) {
        return (FixedPoint){
            .data = a.data + (b.data << a.pointPosition - b.pointPosition),
            .pointPosition = a.pointPosition
        };
    } else {
        return (FixedPoint){
            .data = (a.data << b.pointPosition - a.pointPosition) + b.data,
            .pointPosition = b.pointPosition
        };
    }
}

This approach is a bit cleaner in the usage, however, it introduces significant overhead. That overhead consists of:

  1. The function calls.

  2. The copying of the structs for parameter and result passing, or the pointer dereferences if you use pointers.

  3. The need to calculate the point adjustments at runtime.

This is pretty much similar to the overhead of a C++ class without templates. Using templates would move some decisions back to compile time, at the cost of loosing flexibility.

This object based approach is probably the most flexible one, and it allows you to add support for non-binary point positions in a transparent way.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top