質問

I'm running a classifier (logistic regression). The information on my dataset are the following:

dataset size= 279 observations 

(80/20 rule)

train size= 233
test size = 56

# of events in train = 31
# of events in test = 8

I think my classifier and results may be affected due to this not equal proportion. Is there any way to avoid bias issues and improve accuracy? What do you personally think of such data?

役に立ちましたか?

解決

If you're referring to the fact that your dataset is small:

If you're referring to the class imbalance being 31:202 in train and 8:48 in test:

  • Use AUROC and PRC to eliminate bias in thresholding
  • Also see MCC

他のヒント

I think in case of such unsymmetric data, where the output is outnumbered by one of the classes. Recall would be a good choice of measure than accuracy. The recall gives us the percentage of the relevant class actually predicted by the model.

To complete @BenjiAlbert answer, in case of imbalanced dataset, it is also recommended to use stratified k-fold to preserve the relative class frequencies in each fold. You can find more details in the sklearn user guide here.

ライセンス: CC-BY-SA帰属
所属していません datascience.stackexchange
scroll top