문제

I'm using sklearn/pandas/numpy.

I have a labeled data set, where the potential outcomes are either True or False. However, the data set has a much higher proportion of True entries. When running through classifiers with k-fold (n=5) cross validation, this appears to bias the classifier towards just saying True.

Using weights, I was able to adjust the sample data set I'm using to have a proportion closer to 1:1, like so (using a pandas csv):

results = csv[['result']]
weights = np.where(results.as_matrix() == True,0.25,1).ravel()
csv_sample = csv.sample(n=60000, weights=weights)

And the results are much more promising! However, I'm wondering if there's a way for me to do cross validation where the TRAINING set is adjusted in this manner, but the TEST set is closer to the actual proportion of data.

도움이 되었습니까?

해결책

Try to use predictor option class_weight='balanced' or auto. It worked really well for me for SGDClassifier in a similar situation.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top