Question

I have trouble understanding how pandas and/or numpy are handling NaN values. I am extracting subsets of a pandas dataframe in order to compute t-stats, e.g. I want to know whether there is a significant difference in the mean of x2 for the group whose x1 value is A compared to those with an x1 value of B (apologies for not making this a working example, but I don't know how to recreate the NaN values that pop up in my dataframe, the original data is read in using read_csv, with the csv denoting missing values with NA):

import numpy as np
import pandas as pd
import scipy.stats as st
A = data[data['x1']=='A']['x2']
B = data[data['x1']=='B'].x2
A

2      3
3      1
5      2
6      3
10     3
12     2
15     2
16     0
21     0
24     1
25     1
28   NaN
31     0
32     3
...
677     0
681   NaN
682     3
683     1
686     2
Name: praxiserf, Length: 335, dtype: float64

That is, I have two pandas.core.series.Series objects, which I then want to perform a t-test on. However, using

st.ttest_ind(A, B)

returns:

(array(nan), nan)

I presume this has to do with the fact that ttest_ind accepts arrays as inputs and there seems to be a problem with my NaN values when converting the series to an array. If I try to calculate means of the original series, I get:

A.mean(), B.mean()

1.5802, 1.2

However, when I try to turn the series into an array, I get:

A_array = np.asarray(A)
A_array

array([ 3., 1., 2., 3., 3., 2., 2., 0., 0., 1., 1.,
        nan, 0., 3., ..., 1., nan, 0., 3. ])

That is, NaN turned into nan and taking means doesn't work anymore:

A.mean()

nan

How should the missing values be treated in order to ensure that I can still do calculations with the series/array?

Était-ce utile?

La solution

pandas uses the same code as the bottleneck nanmean function, I believe, thus automatically ignoring nans. numpy doesn't do that for you. What you really want to do, however, is mask the nan-values in both series and pass that to the t-test:

mask = numpy.logical_and(numpy.isfinite(A), numpy.isfinite(B))
st.ttest_ind(A[mask], B[mask])

Autres conseils

ttest_ind takes a parameter called "nan_policy" that dictates how nans are treated. By default nan_policy is "propagate" which results in nan if any values in the input are nan. "raise" will raise an error if any inputs are nan. "omit" ignores nan.

st.ttest_ind(A, B, nan_policy="omit")

should give you a non-nan result.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top