Question

I'm using sklearn/pandas/numpy.

I have a labeled data set, where the potential outcomes are either True or False. However, the data set has a much higher proportion of True entries. When running through classifiers with k-fold (n=5) cross validation, this appears to bias the classifier towards just saying True.

Using weights, I was able to adjust the sample data set I'm using to have a proportion closer to 1:1, like so (using a pandas csv):

results = csv[['result']]
weights = np.where(results.as_matrix() == True,0.25,1).ravel()
csv_sample = csv.sample(n=60000, weights=weights)

And the results are much more promising! However, I'm wondering if there's a way for me to do cross validation where the TRAINING set is adjusted in this manner, but the TEST set is closer to the actual proportion of data.

Was it helpful?

Solution

Try to use predictor option class_weight='balanced' or auto. It worked really well for me for SGDClassifier in a similar situation.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top