Calculating variance with large numbers

https://stackoverflow.com/questions/1721980

19-09-2019
|

Question

I haven't really used variance calculation that much, and I don't know quite what to expect. Actually I'm not too good with math at all.

I have a an array of 1000000 random numeric values in the range 0-10000.

The array could grow even larger, so I use 64 bit int for sum.

I have tried to find code on how to calc variance, but I don't know if I get correct output.

The mean is 4692 and median is 4533. I get variance 1483780.469308 using the following code:

// size is the element count, in this case 1000000
// value_sum is __int64

double p2 = pow( (double)(value_sum - (value_sum/size)), (double)2.0 );
double variance = sqrt( (double)(p2 / (size-1)) );

Am I getting a reasonable value?

Is anything wrong with the calculation?

Solution

Note: It doesn't look like you're calculating the variance.

Variance is calculated by subtracting the mean from every element and calculating the weighted sum of these differences.

So what you need to do is:

// Get mean
double mean = static_cast<double>(value_sum)/size;

// Calculate variance
double variance = 0;
for(int i = 0;i<size;++i) 
{
  variance += (MyArray[i]-mean)*(MyArray[i]-mean)/size;
}

// Display
cout<<variance;

Note that this is the sample variance, and is used when the underlying distribution is unknown (so we assume a uniform distribution).

Also, after some digging around, I found that this is not an unbiased estimator. Wolfram Alpha has something to say about this, but as an example, when MATLAB computes the variance, it returns the "bias-corrected sample variance".

The bias-corrected variance can be obtained by dividing by each element by size-1, or:

//Please check that size > 1
variance += (MyArray[i]-mean)*(MyArray[i]-mean)/(size-1);

Also note that, the value of mean remains the same.

OTHER TIPS

First of all, if you're just looking to get a handle on what is a "reasonable" variance, keep in mind that variance is basically standard deviation squared. Standard deviation roughly measures the typical distance from a data point to its expected value.

So if your data has mean 4692, and your calculated variance is coming out to 1483780, that means your standard deviation is about 1218, which would suggest your numbers tend to be somewhere in the vicinity of the range 3474 - 5910. So that variance actually seems a bit low to me if the range of your numbers is 0 - 10000; but it obviously depends on the distribution of your data.

As for the calculation itself: You can calculate the variance using a running calculation as you're reading your data the first time around (you don't have to know the mean in advance) using Welford's Method:

Initialize M1 = x1 and S1 = 0.

For subsequent x's, use the recurrence formulas

Mk = Mk-1+ (xk - Mk-1)/k Sk = Sk-1 + (xk - Mk-1)*(xk - Mk).

For 2 ≤ k ≤ n, the kth estimate of the variance is s2 = Sk/(k - 1).

Just for fun, a slightly different route to the same result, using std::valarray instead of std::vector and (various) algorithms:

template <class T>
T const variance(std::valarray<T> const &v) {
    if (v.size() == 0)
        return T(0.0);
    T average = v.sum() / v.size();
    std::valarray<T> diffs = v-average;
    diffs *= diffs;
    return diffs.sum()/diffs.size();
}

As Jacob hinted, there are really two possible versions of a variance calculation. As it stands, this assumes your inputs are the "universe". If you've taken only a sample of the overall universe, the last line should use: (diffs.size()-1) instead of diffs.size().

Use a different formula maybe?

#include <functional>
#include <algorithm>
#include <iostream>
int main()
{
 using namespace std;

 vector<double> num( 3 );
 num[ 0 ] = 4000.9, num[ 1 ] = 11111.221, num[ 2 ] = -2;


 double mean = std::accumulate(num.begin(), num.end(), 0.0) / num.size();
 vector<double> diff(num.size());
 std::transform(num.begin(), num.end(), diff.begin(), 
                std::bind2nd(std::minus<double>(), mean));
 double variance = std::inner_product(diff.begin(), diff.end(), 
                                     diff.begin(), 0.0) / (num.size() - 1);
 cout << "mean = " << mean << endl
      << "variance = " << variance << endl;
}

Outputs: mean = 5036.71 variance = 3.16806e+07

Sample Variance calculation:

#include <math.h>
#include <vector>

double Variance(std::vector<double>);

int main()
{
     std::vector<double> samples;
     samples.push_back(2.0);
     samples.push_back(3.0);
     samples.push_back(4.0);
     samples.push_back(5.0);
     samples.push_back(6.0);
     samples.push_back(7.0);

     double variance = Variance(samples);
     return 0;
}

double Variance(std::vector<double> samples)
{
     int size = samples.size();

     double variance = 0;
     double t = samples[0];
     for (int i = 1; i < size; i++)
     {
          t += samples[i];
          double diff = ((i + 1) * samples[i]) - t;
          variance += (diff * diff) / ((i + 1.0) *i);
     }

     return variance / (size - 1);
}

Since you're working with large numbers and then doing floating-point operations on them, you might want to do everything in doubles; that would save you a lot of casts.

Using pow .. 2 to calculate a square seems a bit awkward. You could calculate your number first, then multiply it by itself to get a square.

If you're doing division and feel the need to cast, cast the operands (i.e. the numerator and/or denominator) to double rather than the result. You're losing accuracy if you divide integers.

I'm not sure if your formula for variance is correct. You may want to look at the explanation in Wikipedia, for example. But I'm no math expert either, so I'm not sure you have a mistake.

Since variance is the square of the standard deviation, the answers to SO 1174984 should help out. The short diagnosis is that you need to compute the sum of the squares of the values as well as the sum of the values, and you don't seem to be doing that.

Since you have 10⁶ values, and the square of any value can be up to 10⁸, you could end up with a sum of squares up to 10¹⁴; your 64-bit integers can store up to 10¹⁸, so you could still handle ten thousand times as many inputs, or values ranging up to one million instead of only ten thousand, without running into overflows. There's no urgent need, therefore, to move to pure double computations.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow