how to impute missing values on numpy array created by train_test_split from pandas.DataFrame?

https://datascience.stackexchange.com/questions/927

16-10-2019
|

Question

I'm working on the dataset with lots of NA values with sklearn and pandas.DataFrame. I implemented different imputation strategies for different columns of the dataFrame based column names. For example NAs predictor 'var1' I impute with 0's and for 'var2' with mean.

When I try to cross validate my model using train_test_split it returns me a nparray which does not have column names. How can I impute missing values in this nparray?

P.S. I do not impute missing values in the original data set before splitting on purpose so I keep test and validation sets separately.

Solution

Can you just cast your nparray from train_test_split back into a pandas dataFrame so you can carry out your same strategy. This is very common to what I do when dealing with pandas and scikit. For example,

 a = train_test_split
 new_df = pd.DataFrame(a)

OTHER TIPS

From the link you mentioned in the comment, the train and test sets should be in the form of a dataframe if you followed the first explanation.

In that case, you could do something like this:

df[variable] = df[variable].fillna(df[variable].median())

You have options on what to fill the N/A values with, check out the link. http://pandas.pydata.org/pandas-docs/stable/missing_data.html

If you followed the second explanation, using sklearn's cross-validation, you could implement mike1886's suggestion of transforming the arrays into dataframes and then use the fillna option.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange