Question

I'm currently utilizing an online variance algorithm to calculate the variance for a given sequence. This works nicely, and also gives good numerical stability and overflow resistance, at the cost of some speed, which is fine. My question is, does an algorithm exist that will be faster than this if the sample mean is already known, while having similar stability and resistance to overflow (hence not something like a naive variance calculation).

The current online variance calculation is a single-pass algorithm with both divisions and multiplications in the main loop (which is what impacts speed). From wikipedia:

def online_variance(data):
    n = 0
    mean = 0
    M2 = 0

    for x in data:
        n = n + 1
        delta = x - mean
        mean = mean + delta/n
        M2 = M2 + delta*(x - mean)

    variance = M2/(n - 1)
    return variance
Was it helpful?

Solution

The thing that causes a naive variance calculation to go unstable is the fact that you separately sum the X (to get mean(x)) and the X^2 values and then take the difference

var = mean(x^2) - (mean(x))^2

But since the definition of variance is

var = mean((x - mean(x))^2)

You can just evaluate that and it will be as fast as it can be. When you don't know the mean, you have to compute it first for stability, or use the "naive" formulation that goes through the data only once at the expense of numerical stability.

EDIT Now that you have given the "original" code, it's easy to be better (faster). As you correctly point out, the division in the inner loop is slowing you down. Try this one for comparison:

def newVariance(data, mean):
  n = 0
  M2 = 0

  for x in data:
    n = n + 1
    delta = x - mean
    M2 = M2 + delta * delta

  variance = M2 / (n - 1)
  return variance

Note - this looks a lot like the two_pass_variance algorithm from Wikipedia, except that you don't need the first pass to compute the mean since you say it is already known.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top