Avoiding numerical overflow when calculating the value AND gradient of the Logistic loss function

Question 1

How about the following approximations:

– For computing L, if s is large, then exp(s) will be much larger than 1:

1 + exp(s) ≅ exp(s)

and consequently

log(1 + exp(s)) ≅ log(exp(s)) = s.

If s is small, then using the Taylor series of exp()

exp(s) ≅ 1 + s

and using the Taylor series of log()

log(1 + exp(s)) ≅ log(2 + s) ≅ log(2) + s / 2.

– For computing dL, for large s

exp(s) ./ (1 + exp(s)) ≅ 1

and for small s

exp(s) ./ (1 + exp(s)) ≅ 1/2 + s / 4.

– The code to compute L could look for example like this:

s = X*beta;
l = log(1+exp(s));
ind = isinf(l);
l(ind) = s(ind);
ind = (l == 0);
l(ind) = log(2) + s(ind) / 2;
L = 1/N * sum(l,1)

Question 2

I found a good article about this problem.

Cutting through a lot of words, we can simplify the argument to stating that the original expression

log(1 + exp(s))

can be rewritten as

log(exp(s)*(exp(-s) + 1))
= log(exp(s)) + log(exp(-s) + 1)
= s + log(exp(-s) + 1)

This stops overflow from occurring - it doesn't prevent underflow, but by the time that occurs, you have your answer (namely, s). You can't just use this instead of the original, since it will still give you problems. However, we now have the basis for a function that can be written that will be accurate and won't produce over/underflow:

function LL = logistic(s)
if s<0
  LL = log(1 + exp(s));
else
  LL = s + logistic(-s);

I think this maintains reasonably good accuracy.

EDIT now to the meat of your question - making this vectorized, and allowing the calculation of the slope as well. Let's take these one at a time:

function LL = logisticVec(s)
  LL = zeros(size(s));
  LL(s<0) = log(1 + exp(s(s<0)));
  LL(s>=0) = s(s>=0) + log(1 + exp(-s(s>=0)));

To obtain the average you wanted:

L = logisticVec(X*beta) / N;

The slope is a little bit trickier; note I believe you may have a typo in your expression (missing a multiplication sign).

dL/dbeta = sum(X * exp(X*beta) ./ (1 + exp(X*beta))) / N;

If we divide top and bottom by exp(X*beta) we get

dL = sum(X ./ (exp(-X*beta) + 1)) / N;

Once again, the overflow has gone away and we are left with underflow - but since the underflowed value has 1 added to it, the error this creates is insignificant.