Fixed-point scaling and accuracy in multiplication

https://stackoverflow.com/questions/19056657

29-06-2022
|

Question

I need to perform a multiplication operation on a fixed-point variable x (unsigned 16-bit integer [U16] type with binary point 6 [BP6]) with a coefficient A, which I know will always be between 0 and 1. Code is being written in C for a 32-bit embedded platform.

I know that if I were to also make this coefficient a U16 BP6, then I would end up with a U32 BP12 from the multiplication. I want to rescale this result back down to U16 BP6, so I just lop off the first 10 bits and the last 6.

However, since the coefficient is limited in precision by the number of fractional bits, and I do not necessarily need the full 10 bits of integer, I was thinking that I could just make the coefficient variable A a U16 BP15 to yield a more precise result.

I have worked out the following example (bear with me):

Let's say that x = 172.0 (decimal) and I want to use a coefficient A = 0.82 (decimal). The ideal decimal result would be 172.0 * 0.82 = 141.04.

In binary, x = 0010101100.000000.

If I am using BP6 for A, the binary representation will be either

    A_1 = 0000000000.110100 = 0.8125 or
    A_2 = 0000000000.110101 = 0.828125

(depending on whether value is based on floor or ceiling).

Performing the binary multiplication between x and either value of A yields (leaving out leading zeroes):

    A_1 * x = 10001011.110000000000 = 139.75 
    A_2 * x = 10001110.011100000000 = 142.4375

In both cases, triming down the last 6 bits would not affect the result.

Now, if I expanded A to have BP15, then

    A_3 = 0.110100011110110 = 0.82000732421875

and the resulting multiplication yields

    A_3 * x = 10001101.000010101001000000000 = 141.041259765625

When trimming the extra 15 fractional bits, the result is

    A_3 * x = 10001101.000010 = 141.03125

So it's pretty clear here that by expanding the coefficient to have more fractional bits yields a more precise result (at least in my example). Is this something which will hold true in general? Is this good/bad to use in practice? Am I missing or misunderstanding something?

EDIT: I should have said "accuracy" in place of "precision" here. I am looking for a result which is closer to my expected value rather than a result which contains more fractional bits.

Solution

Having done similar code, I'd say you what you are doing will hold true in general with the following concerns.

It is very easy to get unexpected overflow when shifting around your binary point. Rigorous testing/analysis and/or code detect is recommended. Notable failure: Ariane_5
You want precision, thus I disagree with "lop off ... last 6". Instead I recommend rounding your results as processing time allows. Use the MSBit to be lopped off to possibly adjust the result.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow