Rounding up integer without using float, double, or division

Question 1

The problem is not precision. You're using plenty of bits.

I suspect the problem is that you're comparing two different methods of converting to int. The first is a cast of a double, the second is a truncation by right-shifting.

Converting floating point to integer simply drops the fractional part, leading to a round towards zero; right-shifting does a round down or floor. For positive numbers there's no difference, but for negative numbers the two methods will be 1 off from each other. See an example at http://ideone.com/rkckuy and some background reading at Wikipedia.

Your original code is easy to fix:

int32_t result = (val*((val * 0x535A8) - 0x2675F70));
if (result < 0)
    result += 0xffffff;
result = result>>24;

See the results at http://ideone.com/D0pNPF

You might also just decide that the right shift result is OK as is. The conversion error isn't greater than it is for the other method, just different.

Edit: If you want to do rounding instead of truncation the answer is even easier.

int32_t result = (val*((val * 0x535A8) - 0x2675F70));
result = (result + (1L << 23)) >> 24;

I'm going to join in with some of the others in suggesting that you use a constant expression to replace those magic constants with something that documents how they were derived.

static const int32_t a = (int32_t)(0.02035 * (1L << 24) + 0.5);
static const int32_t b = (int32_t)(2.4038 * (1L << 24) + 0.5);
int32_t result = (val*((val * a) - b));

Question 2

How about just scaling your constants by 10000. The maximum number you then get is 2035*120*120 - 24038*120 = 26419440, which is far below the 2^31 limit. So maybe there is no need to do real bit-tweaking here.

As noted by Joe Hass, your problem is that you shift your precision bits into the dustbin.

Whether shifting your decimals by 2 or by 10 to the left does actually not matter. Just pretend your decimal point is not behind the last bit but at the shifted position. If you keep computing with the result, shifting by 2 is likely easier to handle. If you just want to output the result, shift by powers of ten as proposed above, convert the digits and insert the decimal point 5 characters from the right.

Question 3

Givens:

Lets assume 1 <= c <= 120,
original equation: 0.02035*c*c - 2.4038*c
then -70.98586 < f(c) < 4.585
--> -71 <= result <= 5
rounding f(c) to nearest int32_t.
Arguments A = 0.02035 and B = 2.4038
A & B may change a bit with subsequent compiles, but not at run-time.

Allow coder to input values like 0.02035 & 2.4038. The key components shown here and by others it to scale the factors like 0.02035 to by some power-of-2, do the equation (simplified into the form (A*c - B)*c) and the scale the result back.

Important features:

1 When determining A and B, insure the compile time floating point multiplication and final conversion occurs via a round and not a truncation. With positive values, the + 0.5 achieves that. Without a rounded answer UD_A*UD_Scaling could end up just under a whole number and truncate away 0.999999 when converting to the int32_t

2 Instead of doing expensive division at run-time, we do >> (right shift). By adding half the divisor (as suggested by @Joe Hass), before the division, we get a nicely rounded answer. It is important not to code in / here as some_signed_int / 4 and some_signed_int >> 2 do not round the same way. With 2's complement, >> truncates toward INT_MIN whereas / truncates toward 0.

#define UD_A          (0.02035)
#define UD_B          (2.4038)
#define UD_Shift      (24)
#define UD_Scaling    ((int32_t) 1 << UD_Shift)
#define UD_ScA        ((int32_t) (UD_A*UD_Scaling + 0.5))
#define UD_ScB        ((int32_t) (UD_B*UD_Scaling + 0.5))

for (int32_t val = 1; val <= 120; val++) {
  int32_t result = ((UD_A*val - UD_B)*val + UD_Scaling/2) >> UD_Shift; 
  printf("%" PRId32 "%" PRId32 "\n", val, result);
}

Example differences:

val,   OP equation,  OP code, This code
  1,      -2.38345,       -3,       -2
 54,     -70.46460,      -71,      -70
120,       4.58400,        4,        5

This is a new answer. My old +1 answer deleted.

Question 4

If you r input uses max 7 bits and you have 32 bit available then your best bet is to shift everything by as many bits as possible and work with that:

int32_t result;
result = (val * (int32_t)(0.02035 * 0x1000000)) - (int32_t)(2.4038 * 0x1000000);
result >>= 8; // make room for another 7 bit multiplication
result *= val;
result >>= 16;

Constant conversion will be done by an optimising compiler at compile time.