Question

Basically, I would like to identify whether the missing values in data set are continuously repeated or not. If there are countinously repeated missing values in the data set, I would like to know whether lengths of the each continuously repeated missing value sets are above certian number or not.

For example:

data =['1', '0', '9', '31', '11', '12', 'nan', '10', '44', '53', '12', '66', '99', '3', '2', '6.75833',....., 'nan', 'nan', 'nan', '3', '7', 'nan', 'nan']

In data above, the total number of 'nan' would be 6 and it could be calculated with data.count('nan'). However, what I want to know is how much continuously the missing value can be repeated. For this data, the answer would be 3.

I apologize that I don't show my example code, but I am a very novice in this area and I couldn't have any idea for coding.

Any idea, help or tips would be really appreciated.

Was it helpful?

Solution

This looks like a job for itertools.groupby():

>>> from itertools import groupby
>>> data =['1', '0', '9', '31', '11', '12', 'nan', '10', '44', '53', 
           '12', '66', '99', '3', '2', '6.75833', 'nan', 'nan', 'nan', 
           '3', '7', 'nan', 'nan']
>>> [len(list(group)) for key, group in groupby(data) if key == 'nan']
[1, 3, 2]

Note if your code actually has real NaNs instead of strings, the if key == 'nan'equality test should be replaced with math.isnan(key).

OTHER TIPS

Or you can try this one, which is faster:

grouped_L = [sum(1 for i in group) for k,group in groupby(L)]

Using pyrle for speed. In this solution I replace nan with a number not in the data (-42). This is because nan is a difficult value for rles, as np.nan != np.nan and hence no nans are treated as consecutive.

import numpy as np

data =['1', '0', '9', '31', '11', '12', 'nan', '10', '44', '53', '12', '66', '99', '3', '2', '6.75833', 'nan', 'nan', 'nan', '3', '7', 'nan', 'nan']
arr = np.array([np.float(f) for f in data])
assert not -42 in arr


from pyrle import Rle

r = Rle(arr)
arr[np.isnan(arr)] = -42
is_nan = r.values == -42
np.max(r.runs[is_nan])
# 3
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top