문제

I am working on a multi-class classification problem, with ~65 features and ~150K instances. 30% of features are categorical and the rest are numerical (continuous). I understand that standardization or normalization should be done after splitting the data into train and test subsets, but I am not still sure about the imputation process. For the classification task, I am planning to use Random Forest, Logistic Regression, and XGBOOST (which are not distance-based).

Could someone please explain which should come first? Split > imputation or imputation>split? In case that split>imputation is correct, should I follow imputation>standardization or standardization>imputation?

올바른 솔루션이 없습니다

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top