Question

In R I can quickly see a count of missing data using the summary command, but the equivalent pandas DataFrame method, describe does not report these values.

I gather I can do something like

len(mydata.index) - mydata.count()

to compute the number of missing values for each column, but I wonder if there's a better idiom (or if my approach is even right).

Was it helpful?

Solution

Both describe and info report the count of non-missing values.

In [1]: df = DataFrame(np.random.randn(10,2))

In [2]: df.iloc[3:6,0] = np.nan

In [3]: df
Out[3]: 
          0         1
0 -0.560342  1.862640
1 -1.237742  0.596384
2  0.603539 -1.561594
3       NaN  3.018954
4       NaN -0.046759
5       NaN  0.480158
6  0.113200 -0.911159
7  0.990895  0.612990
8  0.668534 -0.701769
9 -0.607247 -0.489427

[10 rows x 2 columns]

In [4]: df.describe()
Out[4]: 
              0          1
count  7.000000  10.000000
mean  -0.004166   0.286042
std    0.818586   1.363422
min   -1.237742  -1.561594
25%   -0.583795  -0.648684
50%    0.113200   0.216699
75%    0.636036   0.608839
max    0.990895   3.018954

[8 rows x 2 columns]


In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 2 columns):
0    7 non-null float64
1    10 non-null float64
dtypes: float64(2)

To get a count of missing, your soln is correct

In [20]: len(df.index)-df.count()
Out[20]: 
0    3
1    0
dtype: int64

You could do this too

In [23]: df.isnull().sum()
Out[23]: 
0    3
1    0
dtype: int64

OTHER TIPS

As a tiny addition, to get percentage missing by DataFrame column, combining @Jeff and @userS's answers above gets you:

100*(df.isnull().sum())/len(df)

Following one will do the trick and will return counts of nulls for every column:

df.isnull().sum(axis=0)

df.isnull() returns a dataframe with True / False values
sum(axis=0) sums the values across all rows for a column

This isnt quite a full summary, but it will give you a quick sense of your column level data

def getPctMissing(series):
    num = series.isnull().sum()
    den = series.count()
    return 100*(num/den)

If you want to see not null summary of each column , just use df.info(null_counts=True):

Example 1:

df = pd.DataFrame(np.random.randn(10,5), columns=list('abcde'))
df.iloc[:4,0] = np.nan
df.iloc[:3,1] = np.nan
df.iloc[:2,2] = np.nan
df.iloc[:1,3] = np.nan

df.info(null_counts=True)

output:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       6 non-null      float64
 1   b       7 non-null      float64
 2   c       8 non-null      float64
 3   d       9 non-null      float64
 4   e       10 non-null     float64
dtypes: float64(5)
memory usage: 528.0 bytes

In addition, if you want to customize the result , such as add nan_rate , I wrote a method


def describe_nan(df):
    return pd.DataFrame([(i, df[df[i].isna()].shape[0],df[df[i].isna()].shape[0]/df.shape[0]) for i in df.columns], columns=['column', 'nan_counts', 'nan_rate'])

describe_nan(df)

>>> column  nan_counts  nan_rate
0   a   4   0.4
1   b   3   0.3
2   c   2   0.2
3   d   1   0.1
4   e   0   0.0

If you didn't care which columns had Nan's and you just wanted to check overall, just add a second .sum() to get a single value.

result = df.isnull().sum().sum()
result > 0

a Series would only need one .sum() and a Panel() would need three

I had to process numerous large datasets to get NaNs information (counts and portions per column) and timing was an issue. So I timed various methods for getting summary counts of NaNs per column in a separate dataframe with column names, NaN counts and NaN portions as columns:

# create random dataframe
dfa = pd.DataFrame(np.random.randn(100000,300))
# add 30% random NaNs
dfa = dfa.mask(np.random.random(dfa.shape) < 0.3)

With pandas methods only:

%%timeit
nans_dfa = dfa.isna().sum().rename_axis('Columns').reset_index(name='Counts')
nans_dfa["NaNportions"] = nans_dfa["Counts"] / dfa.shape[0]

# Output:
# 10 loops, best of 5: 57.8 ms per loop

Using list comprehension, based on the fine answer from @Mithril:

%%timeit
nan_dfa_loop2 = pd.DataFrame([(col, dfa[dfa[col].isna()].shape[0], dfa[dfa[col].isna()].shape[0]/dfa.shape[0]) for col in dfa.columns], columns=('Columns', 'Counts', 'NaNportions'))

# Output:
# 1 loop, best of 5: 13.9 s per loop

Using list comprehension with a second for loop to store the result of method calls to reduce calls to these methods:

%%timeit
nan_dfa_loop1 = pd.DataFrame([(col, n, n/dfa.shape[0]) for col in dfa.columns for n in (dfa[col].isna().sum(),) if n], columns=('Columns', 'Counts', 'NaNportions'))

# Output:
# 1 loop, best of 5: 373 ms per loop

All the above will produce the same dataframe:

    Columns Counts  NaNportions
0   0   29902   0.29902
1   1   30101   0.30101
2   2   30008   0.30008
3   3   30194   0.30194
4   4   29856   0.29856
... ... ... ...
295 295 29823   0.29823
296 296 29818   0.29818
297 297 29979   0.29979
298 298 30050   0.30050
299 299 30192   0.30192

('Columns' is redundant with this test dataframe. It is just used as placeholder where in a real life dataset it would probably represent the names of the attributes in the initial dataframe.)

More precise one:

missed_values = df.isnull()

for col in missed_values.columns.values.tolist():

   if True in missed_values[col].values:
       print(missed_values[col].name, missed_values[col].value_counts())

This is code makes your life easy

import sidetable

df.stb.missing()

Check this out : https://github.com/chris1610/sidetable

enter image description here

I'd recommend using the missingno package (https://github.com/ResidentMario/missingno), which allows you to quickly and easily visualize missing data from a pandas dataframe. My preferred visualization is a bar chart, but they have others.

import missingno as msno
import pandas as pd

df = pd.read_csv('some_data.csv')

msno.bar(df.sample(1000))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top