How to find correlation between categorical data and continuous data

https://datascience.stackexchange.com/questions/77704

12-12-2020
|

Question

I'm working on imputing null values in the Titanic dataset. The 'Embarked' column has some. I do NOT want to just set them all to the most common value, 'S'. I want to impute 'Embarked' based on its correlation with the other columns.

I have tried applying this formula to the 'Embarked' column:

def embark(e):
        if e == 'S': return 1
        if e == 'Q': return 2
        if e == 'C': return 3
        else: return 4

This allows me to check out data.corr(), but I think it's trickier than that since I'll get a different correlation with different value assignments (right??). I also thought about using a four-dimensional (for S,Q,C,NaN) one-hot vector, but I doubt that would work.

Is there a skLearn method that does this some way? Any further insights on the matter?

Solution

I suggest trying the sklearn module KNNImputer. KNN will use clustering to calculate the null/missing values based on the data that is available (non-null). It should handle numerical and categorical data. You may need to do some encoding on the non-null values first.

You can also look at creating and modelling with multiple imputed datasets using different imputation settings/values and then compare or combine the results. This will help deal with some of the problems inherent in imputation, such as

randomness
high sampling variability
standard error estimates, etc

Other options:

RandomForest fancyimpute
missingpy

It is fine to impute the data in your test dataset also. Just be sure not to include the label or response in any of the imputation, as that value won't be available in a new dataset.

Also, any imputation method you use should be calculated on the train dataset and then applied to the test dataset. This will prevent data or information leakage between the 2 datasets, and will simulate model performance on any future datasets you use the model on.

One more thing: After imputing, you should look at the distributions of both train and test datasets to compare their distributions, you want them to match as closely as possible.

References:

https://towardsdatascience.com/the-use-of-knn-for-missing-values-cf33d935c637

https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/discussion/80000

https://towardsdatascience.com/preprocessing-encode-and-knn-impute-all-categorical-features-fast-b05f50b4dfaa

https://statisticalhorizons.com/more-imputations

OTHER TIPS

For your specific case I would recommend the "grouped mode" because that would be the value you are interested in imputing (I did the same for this kaggle challenge).

On more general terms we have to understand the scale of each variable. We often talk about categorical data but in more detail we have to differentiate between "nominal data" and "ordinal data".

Income brackets are ordinal, that means there is a clear numerical hierarchy, while other data such as the "Embarkment" here is more nominal, that means there is no order or numerical relation.

Therefore a correlation is impossible and it is more helpful to think in terms of grouped distributions.

What does that mean for imputation

Finally besides simply imputing the grouped mode (that is the most frequent value per group) you could try imputation algorithms like MICE which are in a way predicting the missing value based on other variables. This is possible for nominal as well as ordinal categorical factors.

Please be aware that imputation is highly depended on the question of why a value is missing. You have to make sure that the value is not MNAR before you try imputation at all.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange