Replace floating point math with integer in sigmoid transfer function

Question 1

A fixed-point math library is what you need. My preferred solution for this is Anthony Williams' fixed-Point math C++ library. Because it is in C++ and defines a fixed class with extensive function and operator overloading, it can largely be used simply by replacing float or double in your existing code with fixed. It uses int64_t as the underlying integer data type, with 34 integer bits and 28 fractional bits (34Q28), so is good for about 8 decimal places and a wider range than int32_t.

If your compiler supports C++, you can still essentially write your code using the essentially C subset if you prefer, using C++ only to support this library.

On 32bit ARM this library performs about 5 times faster than software-floating point and is comparable in performance to ARM's VFP unit for C code.

Note that the sqrt() function in this library has poor precision performance for very small values as it looses lower-order bits in intermediate calculations that can be preserved. It can be improved by replacing it with the code the version I presented in this question.

There are no doubt C libraries for fixed-point math but they will lack the simplicity and convenience of a "real" fixed-point data type provided by this library, and this library has a complete set of standard library math function equivalents, while many fixed point solutions just provide basic arithmetic operators.

Question 2

Look at the slowest/hardest part:

fVal/sqrt(1 + fVal**2)

This is all you need to think about.

http://www.wolframalpha.com/input/?i=x%2Fsqrt%281+%2B+x%5E2%29

Its obvious that your fVal is less than or equal to 1.

You are after an approximation in the range x = 0 to x = 1 so something like this: http://www.wolframalpha.com/input/?i=expand+x%2Fsqrt%281+%2B+x%5E2%29+around+x+%3D+0.5

That will likely be enough for your needs. Press the more terms button once to get more accuracy.

To make integers behave like floating points, you can use a simple multiplier scheme, like int = float*10000, but this creates problems when you need the fifth power - you will get overflow. Better to scale everything so all numbers are less than 1, then use fractional integer math library to multiply your numbers.

One simple fractional library I built used LONG_MAX to mean 1.0 (about 9 decimal places of accuracy), then to multiply two of these together (so that LONG_MAX*LONG_MAX = LONG_MAX) I used two lines of assembler. You may have access to a fractional math library in your system.

So basically, scale everything so that the max you have on the way in is 1.0.

When you are done, its fairly easy to test this function by going through a million or so values, and comparing them to the floating point version.

See http://gameprogrammer.com/4-fixed.html and similar pages for how to work with fixed point.

Question 3

One easy thing you can try, which is probably not good enough for you, but fairly simple:

 unsigned int scale = 1000; /* three number after the comma */

 inline int smaller_i(int value, int bias) {
       return value < bias ? value : bias;
 }

 inline int pow2_i(int iVal) {
     return (iVal * iVal) / scale;
 }

 int sigm_i(int iX, int Slope) {
     int iVal = (180*scale - smaller_i(abs(iX) * slope, (179*scale + 9*(scale/10))) / (180*scale);
     return iVal / sqrt_i(1*scale + pow2_i(iVal));
 }

If you have 64 bit integers, this can be enough for you. If you have only 32 bits, I'm not sure. If only 16 bits, these computations will likely overflow, so you need something a bit more complicated.

Also note that you need to write sqrt_i for yourself.

Question 4

The bottleneck is probably fVal / sqrt(1.f + pow2_f(fVal) ).

Try using the Fast Inverse Square Root procedure, which yields a very accurate approximation of 1.0 / sqrt(x) using integer arithmetic.

Question 5

I had this issue for a neural network I wanted to implement on a Raspberry Pi 3 (weights between -127 and 127), and the fastest method I found was a binary search implemented as nested if statements; obviously the if statements needed to be autogenerated and Python came to the rescue.

code

Given a C function:

static
uint16_t sigmoid_lookup(int32_t i) {
#include "autogen_sigmoid_index.i"
}

and a sorted Python list of (sigmoid_value, at_argument), this function creates the if-else tree:

def produce_c_code(numbers_list, idxs, idxe, level):
    if idxs >= idxe:
        raise RuntimeError("idxs=%d idxe=%d")

    indent= " "*level

    if idxs + 1 == idxe: # end of recursion
        yield indent + "return %d;" % numbers_list[idxs][0]
    else:
        idxm= (idxe+idxs)//2
        yield indent + "if(i>=%d)" % numbers_list[idxm][1]
        yield from produce_c_code(numbers_list, idxm, idxe, level+1)
        yield indent + "else"
        yield from produce_c_code(numbers_list, idxs, idxm, level+1)

example

For this number list: [(0, 0), (1, 9), (2, 25), (3, 41), (4, 57), (5, 73), (6, 89)], the code produced is:

 if(i>=41)
  if(i>=73)
   if(i>=89)
    return 6;
   else
    return 5;
  else
   if(i>=57)
    return 4;
   else
    return 3;
 else
  if(i>=9)
   if(i>=25)
    return 2;
   else
    return 1;
  else
   return 0;

benchmarks

The benchmarks are based on the 127 * n / sqrt(n*n + 4194304) sigmoid function of my case, and they are over the input range [-8000000, 8000000].

Pentium M 1.2 GHz

testing lookup CPU time: 300000 testing math CPU time: 1460000

Raspberry Pi 2 800 MHz

testing lookup CPU time: 474094 testing math CPU time: 2897385

Raspberry Pi 3 1.2GHz

testing lookup CPU time: 369665 testing math CPU time: 1570066

Intel Core™2 Q6600 2.4 GHz

testing lookup CPU time: 73623 testing math CPU time: 797847