Question

I am working on a multi-class classification problem, with ~65 features and ~150K instances. 30% of features are categorical and the rest are numerical (continuous). I understand that standardization or normalization should be done after splitting the data into train and test subsets, but I am not still sure about the imputation process. For the classification task, I am planning to use Random Forest, Logistic Regression, and XGBOOST (which are not distance-based).

Could someone please explain which should come first? Split > imputation or imputation>split? In case that split>imputation is correct, should I follow imputation>standardization or standardization>imputation?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top