Dividing a denormalized number by 2

https://stackoverflow.com/questions/9090353

21-04-2021
|

سؤال

I am writing an algorithm that divides a floating point number by 2. In the case of the the numbers already being normalized (the exponent bits are > 0), I think the process is pretty straightforward. I think simply decreasing the exponent field by one and then sticking that value back in is a correct approach.

I am having trouble coming up with how to handle a floating point number already being denormalized (exponent bits are all 0's). I understand what a denormalized number is, and believe I generally understand what it means to divide them. I am running the algorithm I write through another program and here is one message I get that confuses me:

Passing the value 0x7fffff to the function returns 3fffff. The function is supposed to return 0x400000.

I don't really understand what is going on here, and why this is supposed to return this specified value. Is there anybody that can try and explain this and why it is supposed to return this value?

My initial approach for handling a denormalized number was to right shift the fraction bits by one (dividing by 2), and that doesn't seem to be the desired procedure.

Here is what I have:

unsigned float_half(unsigned uf) {

  unsigned exp = uf & (0x7F800000);
  unsigned sign = uf & (0x80000000);
  unsigned fract = uf & (0x007FFFFF);
  // Check for NaN or infinity
  if(exp == 0x7F800000) {
    return uf;
  }
  // Check for denormalized numbers
  if(exp == 0x00000000) {
    // Need to do something here, not really sure...

    return sign | exp | fract;
  } 
  // Check for exponent of 1 (going to a denormalized number changes things)
  if(exp == 0x00800000) {
    fract = (0x00FFFFFF & uf) >> 1;
    return fract | sign;
  }

  exp--;
  exp = exp & (0x7F800000);
  return sign | exp | fract;
}

المحلول

You are probably supposed to round the value in the denormalized case. For 0x7fffff you are cutting off the final 1 bit with the shift. It seems that you are expected to round the value up instead, for example like this:

if(exp == 0x00000000) {
  fract = (0x00FFFFFF & uf) >> 1;
  if (0x00000001 & uf)
    fract++;
  return fract | sign;
}

If you are supposed to round up or down might also depend on the sign.

نصائح أخرى

That the function is supposed to return 0x400000 is to meet the round-to-even mode. Here is my function:

unsigned float_half(unsigned uf){
    unsigned sign = uf & (0x80000000);
    unsigned exp = uf >> 23 & 0xff;
    unsigned frac = f & 0x7fffff;

    if(exp == 0xff)
        return uf;
    else if (exp > 1)
        return sign | --exp << 23 | frac;
    else {
        if (exp == 1)
            frac |= 1 << 23;
        if ((frac & 3) == 3)
            frac++;
        frac >>= 1;
        return sign | frac;
    }
}

Another

unsigned float_half(unsigned uf){
    unsigned sign = uf & (0x80000000);
    unsigned exp_frac = uf & 0x7fffffff;

    if (exp_frac >= 0x7f800000)
        return uf;
    else if (exp_frac > 0x00ffffff)
        return uf + 0xff800000;
    else {
        if ((exp_frac & 3) == 3)
            exp_frac++;
        exp_frac >> 1;
        return sign | exp_frac;
    }
}

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow