Question

I am trying to optimize the QTC video codec to work on the Raspberry Pi with a decent performance. One important bottleneck is a 32 bits integer division done in the range decoder which takes into account for 18% of the decoding time. Since the device's ARM processor apparently lacks an integer division instruction, I think one could easily optimize this. The division has to be accurate.

Both the dividend and the divisor in that particular division are different each call, but it is known that the divisor is always smaller than 65536. I thought about building a lookup table of inverse divisor values. Using that table I could use a multiplication instead of the division. The lookup table is going to have a size of 256 kibibytes.

Questions

  1. Is it a good idea to perform that optimization?
  2. Are there better ways to get rid of the software division?
  3. Is there a different way to implement the algorithm such that there is no division?
  4. Other ideas?
Was it helpful?

Solution

If you want to use a magic multiplication + LUT, here's some code.

Simple tester testing random divisors i. Did not exhaustively test all i's, but has worked for the short period I ran it. Seems to work on all 32-bit states of the dividend (j=0..2^32-1) for the i's I tested.

In reality, you would precompute a lookup table for i=2..64k-1 or some range like that (i=0 wont work because value/0 is undefined and i=1 wont work because the magic multiplier for that is just outside the range of a 32-bit number). Then use the equation using i as your lookup index to get the magic multiplier 'm'. Change as needed & don't hate on my style. :P

#include <stdio.h>

int main() {
  unsigned int i,j,k,m,c;

  // compute j/i,
  // compute k = 2^32/i
  // instead of j/i, use m = ~(j*k)>>32
  srand(time(0));
  for(c=0;c<64;c++) {
    // generate random divisor i's for testing, then fully test every j
    i = rand()&0x7fff;      
    // precompute these and put into a lookup table, index on [i]
    k = (((__int64)1)<<32)/i;  
    for(j=0;j!=-1;j++) {
      // status updater so we know it's working...
      if(!(j&0xfffff)) { printf("%d : %d     \r", i, j); fflush(0); }    
      // multiply instead of divide!
      m = (((__int64)j*k)+k/2)>>32; 
      // rare fixup
      if(j - m*i >= i) m++;                          
      if(m != j/i) {
        // as long as this line doesn't print, we're ok
        printf("wrong : %d %d %d   got: %d  should be: %d\n", 
            i, j, k, m, j/i);    
      }
    }
  }
}

OTHER TIPS

One can also take advantage of the fact, that the Raspberry Pi includes a FP unit that is capable of performing a double precision FP division, which is faster than the software emulation of the integer division. Replacing all integer divisions a = b / c with a = (double)b / (double)c works for me.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top