Train/Test size and bias

https://datascience.stackexchange.com/questions/81009

13-12-2020
|

質問

I'm running a classifier (logistic regression). The information on my dataset are the following:

dataset size= 279 observations

(80/20 rule)

train size= 233
test size = 56

# of events in train = 31
# of events in test = 8

I think my classifier and results may be affected due to this not equal proportion. Is there any way to avoid bias issues and improve accuracy? What do you personally think of such data?

解決

If you're referring to the fact that your dataset is small:

You should use k-fold cross validation. This will let you evaluate your model on all 279 instances

If you're referring to the class imbalance being 31:202 in train and 8:48 in test:

Use AUROC and PRC to eliminate bias in thresholding
Also see MCC

他のヒント

I think in case of such unsymmetric data, where the output is outnumbered by one of the classes. Recall would be a good choice of measure than accuracy. The recall gives us the percentage of the relevant class actually predicted by the model.

To complete @BenjiAlbert answer, in case of imbalanced dataset, it is also recommended to use stratified k-fold to preserve the relative class frequencies in each fold. You can find more details in the sklearn user guide here.

ライセンス： CC-BY-SA と帰属

所属していません datascience.stackexchange