Sequence to carry out data analysis?

https://datascience.stackexchange.com/questions/64462

19-10-2020
|

Pergunta

I have a dataset with 4700 records and it's a classification problem. Proportion of classes is 33 and 67%

few questions

1) does this proportion qualify dataset as imbalanced ?

2) should I do cross validation and then apply (over/under or SMOTE sampling) or I should first balance my sample through these sampling techniques and then do cross validation?

3) Why is propensity score matching used only in healthcare related studies and not much in other applications?

4) How is Propensity score matching different from other ML classification algorithms?

Solução

You should fit preprocessing transformers, i.e. imputation, scalers, encoders, resampling, only to train set and apply them to both train and test respectively. Your dataset is imbalanced and you may expect some improvement using resampling techniques, though you should always confirm it conducting cross validation tests.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange