Which comes first? Multiple Imputation, Splitting into train/test, or Standardization/Normalization
-
01-11-2019 - |
Question
I am working on a multi-class classification problem, with ~65 features and ~150K instances. 30% of features are categorical and the rest are numerical (continuous). I understand that standardization or normalization should be done after splitting the data into train and test subsets, but I am not still sure about the imputation process. For the classification task, I am planning to use Random Forest, Logistic Regression, and XGBOOST (which are not distance-based).
Could someone please explain which should come first? Split > imputation or imputation>split? In case that split>imputation is correct, should I follow imputation>standardization or standardization>imputation?
No correct solution
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange