Domanda

I know about basic data types and that float types (float,double) can not hold some numbers exactly.

In porting some code from Matlab to Python (Numpy) I however found some significant differences in calculations, and I think it's going back to precision.

Take the following code, z-normalizing a 500 dimensional vector with only first two elements having a non-zero value.

Matlab:

Z = repmat(0,500,1); Z(1)=3;Z(2)=1;
Za = (Z-repmat(mean(Z),500,1)) ./ repmat(std(Z),500,1);
Za(1)
>>> 21.1694

Python:

from numpy import zeros,mean,std
Z = zeros((500,))
Z[0] = 3
Z[1] = 1
Za = (Z - mean(Z)) / std(Z)
print Za[0]
>>> 21.1905669677

Besides that the formatting shows a bit more digits in Python, there is a huge difference (imho), more than 0.02

Both Python and Matlab are using a 64 bit data type (afaik). Python uses 'numpy.float64' and Matlab 'double'.

Why is the difference so huge? Which one is more correct?

È stato utile?

Soluzione

Maybe the difference comes from the mean and std calls. Compare those first.

There are several definitions for std, some use the sqaure root of

1 / n * sum((xi - mean(x)) ** 2)

others use

1 / (n - 1) * sum((xi - mean(x)) ** 2)

instead.

From a mathematical point: these formulas are estimators of the variance of a normal distributed random variable. The distribution has two parameters sigma and mu. If you know mu exactly the optimal estimator for sigma ** 2 is

1 / n * sum((xi - mu) ** 2)

If you have to estimate mu from the data using mu = mean(xi), the optimal estimator for sigma**2 is

1 / (n - 1) * sum((xi- mean(x))**2)

Altri suggerimenti

To answer your question, no, this is not a problem of precision. As @rocksportrocker points out, there are two popular estimators for the standard deviation. MATLAB's std has both available but as a standard uses a different one from what you used in Python.

Try std(Z,1) instead of std(Z):

Za = (Z-repmat(mean(Z),500,1)) ./ repmat(std(Z,2),500,1);Za(1)
sprintf('%1.10f', Za(1))

leads to

Za(1) = 21.1905669677

in MATLAB. Read rockspotrocker's answer about which of the two results is more appropriate for what you want to do ;-).

According to the documentation of std at SciPy, it has a parameter called ddof:

ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.

In numpy, ddof is zero by default while in MATLAB is one. So, I think this may solve the problem:

std(Z,ddof=1)
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top