Missing values in scikits machine learning

https://stackoverflow.com/questions/9365982

28-10-2019
|

Question

Is it possible to have missing values in scikit-learn ? How should they be represented? I couldn't find any documentation about that.

Solution

~~Missing values are simply not supported in scikit-learn. There has been discussion on the mailing list about this before, but no attempt to actually write code to handle them.~~

~~Whatever you do, don't use NaN to encode missing values, since many of the algorithms refuse to handle samples containing NaNs.~~

The above answer is outdated; the latest release of scikit-learn has a class Imputer that does simple, per-feature missing value imputation. You can feed it arrays containing NaNs to have those replaced by the mean, median or mode of the corresponding feature.

OTHER TIPS

I wish I could provide a simple example, but I have found that RandomForestRegressor does not handle NaN's gracefully. Performance gets steadily worse when adding features with increasing percentages of NaN's. Features that have "too many" NaN's are completely ignored, even when the nan's indicate very useful information.

This is because the algorithm will never create a split on the decision "isnan" or "ismissing". The algorithm will ignore a feature at a particular level of the tree if that feature has a single NaN in that subset of samples. But, at lower levels of the tree, when sample sizes are smaller, it becomes more likely that a subset of samples won't have a NaN in a particular feature's values, and a split can occur on that feature.

I have tried various imputation techniques to deal with the problem (replace with mean/median, predict missing values using a different model, etc.), but the results were mixed.

Instead, this is my solution: replace NaN's with a single, obviously out-of-range value (like -1.0). This enables the tree to split on the criteria "unknown-value vs known-value". However, there is a strange side-effect of using such out-of-range values: known values near the out-of-range value could get lumped together with the out-of-range value when the algorithm tries to find a good place to split. For example, known 0's could get lumped with the -1's used to replace the NaN's. So your model could change depending on if your out-of-range value is less than the minimum or if it's greater than the maximum (it could get lumped in with the minimum value or maximum value, respectively). This may or may not help the generalization of the technique, the outcome will depend on how similar in behavior minimum- or maximum-value samples are to NaN-value samples.

Replacing a missing value with a mean/median/other stat may not solve the problem as the fact that the value is missing may be significant. For example in a survey on physical characteristics a respondent may not put their height if they were embarrassed about being abnormally tall or small. This would imply that missing values indicate the respondent was unusually tall or small - the opposite of the median value.

What is necessary is a model that has a separate rule for missing values, any attempt to guess the missing value will likely reduce the predictive power of the model.

e.g:

df['xvariable_missing'] = np.where(df.xvariable.isna(),1,0)
df.xvariable = df.xvariable.fillna(df.xvariable.median())

I have come across very similar issue, when running the RandomForestRegressor on data. The presence of NA values were throwing out "nan" for predictions. From scrolling around several discussions, the Documentation by Breiman recommends two solutions for continuous and categorical data respectively.

Calculate the Median of the data from the column(Feature) and use this (Continuous Data)
Determine the most frequently occurring Category and use this (Categorical Data)

According to Breiman the random nature of the algorithm and the number of trees will allow for the correction without too much effect on the accuracy of the prediction. This I feel would be the case if the presence of NA values is sparse, a feature containing many NA values I think will most likely have an affect.

Orange is another python machine learning library that has facilities dedicated to imputation. I have not had a chance to use them, but might be soon, since the simple methods of replacing nan's with zeros, averages, or medians all have significant problems.

I do encounter this problem. In a practical case, I found a package in R called missForest that can handle this problem well, imputing the missing value and greatly enhance my prediction.

Instead of simply replacing NAs with median or mean, missForest replaces them with a prediction of what it thinks the missing value should be. It makes the predictions using a random forest trained on the observed values of a data matrix. It can run very slow on a large data set that contains a high number of missing values. So there is a trade-off for this method.

A similar option in python is predictive_imputer

When you run into missing values on input features, the first order of business is not how to impute the missing. The most important question is WHY SHOULD you. Unless you have clear and definitive mind what the 'true' reality behind the data is, you may want to curtail urge to impute. This is not about technique or package in the first place.

Historically we resorted to tree methods like decision trees mainly because some of us at least felt that imputing missing to estimate regression like linear regression, logistic regression, or even NN is distortive enough that we should have methods that do not require imputing missing 'among the columns'. The so-called missing informativeness. Which should be familiar concept to those familiar with, say, Bayesian.

If you are really modeling on big data, besides talking about it, the chance is you face large number of columns. In common practice of feature extraction like text analytics, you may very well say missing means count=0. That is fine because you know the root cause. The reality, especially when facing structured data sources, is you don't know or simply don't have time to know the root cause. But your engine forces to plug in a value, be it NAN or other place holders that the engine can tolerate, I may very well argue your model is as good as you impute, which does not make sense.

One intriguing question is : if we leave missingness to be judged by its close context inside the splitting process, first or second degree surrogate, does foresting actually make the contextual judgement a moot because the context per se is random selection? This, however, is a 'better' problem. At least it does not hurt that much. It certainly should make preserving missingness unnecessary.

As a practical matter, if you have large number of input features, you probably cannot have a 'good' strategy to impute after all. From the sheer imputation perspective, the best practice is anything but univariate. Which is in the contest of RF pretty much means to use the RF to impute before modeling with it.

Therefore, unless somebody tells me (or us), "we are not able to do that", I think we should enable carrying forward missing 'cells', entirely bypassing the subject of how 'best' to impute.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow