pandas.DataFrame.describe() vs numpy.percentile() NaN handling

https://stackoverflow.com/questions/20614536

02-09-2022
|

Domanda

I noticed a difference in how pandas.DataFrame.describe() and numpy.percentile() handle NaN values. e.g.

import numpy as np
import pandas as pd

a = pd.DataFrame(np.random.rand(100000),columns=['A'])

>>> a.describe()           
              A
count  100000.000000
mean        0.499713
std         0.288722
min         0.000009
25%         0.249372
50%         0.498889
75%         0.749249
max         0.999991

>>> np.percentile(a,[25,50,75])
[0.24937217017643742, 0.49888913303316823, 0.74924862428575034] # Same as a.describe()

# Add in NaN values
a.ix[1:99999:3] = pd.np.NaN

>>> a.describe()
                  A
count  66667.000000
mean       0.499698
std        0.288825
min        0.000031
25%        0.249285
50%        0.500110
75%        0.750201
max        0.999991

>>> np.percentile(a,[25,50,75])
[0.37341740173545901, 0.75020053461424419, nan] # Not the same as a.describe()

# Remove NaN's
b = a[pd.notnull(a.A)]

>>> np.percentile(b,[25,50,75])
[0.2492848255776256, 0.50010992119477615, 0.75020053461424419] # Now in agreement with describe()

Pandas neglects NaN values in percentile calculations, while numpy does not. Is there any compelling reason to include NaN's in percentile calculations? It seesm Pandas handles this correctly, so I wonder why numpy would not make a similar implementation.

Begin Edit

per Jeff's comment, this becomes an issue when resampling data. If I have a time series that contains NaN values and want to resample to percentiles (per this post)

upper = df.resample('1A',how=lambda x: np.percentile(x,q=75))

will include NaN values in calculation (as numpy does). To avoid this, you must instead put

upper = tmp.resample('1A',how=lambda x: np.percentile(x[pd.notnull(x.sample_value)],q=75))

Perhaps a numpy request is in order. Personally, I do not see any reason to include NaNs in percentile calculations. pd.describe() and np.percentile should, in my opinion, return the exact same values (I think this is the expected behavior), but the fact that they do not can be easily missed (this is not mentioned in the documentation for np.percentile), which can skew the stats. That is my concern.

End Edit

Soluzione

For your edited use case, I think I'd stay in pandas and use Series.quantile instead of np.percentile:

>>> df = pd.DataFrame(np.random.rand(100000),columns=['A'], 
...                   index=pd.date_range("Jan 1 2013", freq="H", periods=100000))
>>> df.iloc[1:99999:3] = np.nan
>>> 
>>> upper_np = df.resample('1A',how=lambda x: np.percentile(x,q=75))
>>> upper_np.describe()
        A
count   0
mean  NaN
std   NaN
min   NaN
25%   NaN
50%   NaN
75%   NaN
max   NaN

[8 rows x 1 columns]
>>> upper_pd = df.resample('1A',how=lambda x: x.quantile(0.75))
>>> upper_pd.describe()
               A
count  12.000000
mean    0.745648
std     0.004889
min     0.735160
25%     0.744723
50%     0.747492
75%     0.748965
max     0.750341

[8 rows x 1 columns]

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow