문제

I was working through a tutorial on the titanic disaster from Kaggle and I'm getting different results depending on the details of how I use cross_validation.cross_val_score.

If I call it like:

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

print(scores.mean())

0.801346801347

I get a different set of scores than if I call it like:

kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

print(scores.mean())

0.785634118967

These numbers are close, but different enough to be significant. As far as I understand, both code snippets are asking for a 3-fold cross validation strategy. Can anyone explain what is going on under the hood of the second example which is leading to the slightly lower score?

도움이 되었습니까?

해결책

From the sklearn docs for cross_val_score's cv argument :

"For integer/None inputs, if y is binary or multiclass, StratifiedKFold used. If the estimator is a classifier or if y is neither binary nor multiclass, KFold is used."

I believe that in the first case, StratifiedKFold is being used as the default. In the second case, you are explicitly passing a KFold generator.

The difference between the two is also documented in the docs.

"KFold divides all the samples in k groups of samples, called folds (if k = n, this is equivalent to the Leave One Out strategy), of equal sizes (if possible)."

"StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set."

This difference in folds is what is causing the difference in scores.

As a side note, I noticed that you are passing a random_state argument to the KFold object. However, you should note that this seed is only used if you also set KFold's shuffle parameter to True, which by default is False.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top