Question

I want to calculate variance for each row of numbers in perl. I've written this subroutine:

################################################################
# variance
#
#
# A subroutine to compute the variance of an array
# division by n-1 i s used
#
sub var{
    my ($data) = @_;
    if (@$data ==1) {
        return 0;
    }
    my $mean = mean ($data);
    my $sqtotal = 0;
    foreach (@$data) {
        $sqtotal += ($_ - $mean) ** 2
    }
    my $var = $sqtotal / (scalar @$data - 1);
    return $var;
}

If I gave it this array with 58 elements of the same number

[0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98,0.98]

The calculation gave me 1.25421964097639e-30.

I also tried to use the Statistics::Descriptive module (http://metacpan.org/pod/Statistics::Descriptive) and it gave me 2.11916254524942e-15.

And I also tried this site (http://www.alcula.com/calculators/statistics/variance/) and its result is 2.2438191655582E-15.

Why the results are not the same...

I could have just used the module but it was extremely memory intensive somehow for my file which basically consist of million lines of 58 numbers. I'm not sure why it used up so much memory.

Can someone tell me why my calculation gave a different number from the module and also how to make the module work with less memory? Is the memory intensive thing just the inherent drawback of that module. Several posts seem to suggest that.

Thanks!

Was it helpful?

Solution

The variance of a constant sequence is zero, so your calculations are all more or less correct, and all more or less the same.

You get results slightly different than zero because you are performing many operations with finite precision floating point numbers. Let's take this code:

 $z = 0;
 $z += 0.98 for 1..58;
 $mean = $z / 58;
 printf "%.20f", $mean;

With this code we take the sum of 58 instances of the number 0.98, and then divide the sum by 58. It makes sense that this code would print out 0.98000000000000000000, right? No, what I actually get is

 0.97999999999999887201

(and your results might vary).

The canonical What Every Programmer Should Know About Floating-Point Arithmetic can explain the gory details to you.

OTHER TIPS

Fixed point arithmetic

Do your data use 0.01 (1/100) precision? Example you have provided my suggests it.

YES => you may use fixed point arithmetic instead of floating point arithmetic to reduce cumulation of rounding errors. Use 1/(100**2)=1/10_000 scaling factor.

https://en.wikipedia.org/wiki/Fixed-point_arithmetic


sub var4{
    my ($data) = @_;
    if (@$data ==1) {
        return 0;
    }
    my $totalD2 = 0;
    foreach (@$data) {
       $totalD2 += $_*100
    }
    my $meanD2 = $totalD2 / (scalar @$data);

    my $sqtotalD4 = 0;
    foreach (@$data) {
        $sqtotalD4 += ($_*100 - $meanD2) ** 2;
    }
    my $varD4 = $sqtotalD4 / (scalar @$data - 1);
    return $varD4/10_000; # convert from fixed point to floating point
}

I was waiting for someone to just make the post about floating point arithmetic (ty mob), because that's ultimately the answer.

However, the Statistics::Descriptive module that you linked to was version 2.6 while the current is version 3.0607.

If you look at the source code for 2.6, it uses some stranger math to calculate the mean and the variance:

sub add_data {
  ....

  ##Calculate new mean, pseudo-variance, min and max;
  foreach ( @{ $aref } ) {
    $oldmean = $self->{mean};
    $self->{sum} += $_;
    $self->{count}++;

    ....

    $self->{mean} += ($_ - $oldmean) / $self->{count};
    $self->{pseudo_variance} += ($_ - $oldmean) * ($_ - $self->{mean});
  }

Now this method of calculating the running mean is mathematical accurate and the same as Sum(A1..An)/n. However, because of floating point arithmetic, you're going to see a difference than if it just did the mean straight from the sum/count. Additionally, I suspect that this running variance might be mathematically identical (didn't bother doing a paper proof), but you will also see slight differences due to floating point arithmetic.

The latest version of the module does use the simpler method of calculating the mean and the variance though, and also has some efficiency improvements by utiltizing modules like List::Util. So if your linking to version 2.6 was anything but an oversite, then I suggest that you upgrade to the latest version.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top