I've been trying to figure out scikit's Random Forest sample_weight use and I cannot explain some of the results I'm seeing. Fundamentally I need it to balance a classification problem with unbalanced classes.
In particular, I was expecting that if I used a sample_weights array of all 1's I would get the same result as w sample_weights=None
. Additionally, I was expeting that any array of equal weights (i.e. all 1s, or all 10s or all 0.8s...) would provide the same result. Perhaps my intuition of weights is wrong in this case.
Here's the code:
import numpy as np
from sklearn import ensemble,metrics, cross_validation, datasets
#create a synthetic dataset with unbalanced classes
X,y = datasets.make_classification(
n_samples=10000,
n_features=20,
n_informative=4,
n_redundant=2,
n_repeated=0,
n_classes=2,
n_clusters_per_class=2,
weights=[0.9],
flip_y=0.01,
class_sep=1.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=True,
random_state=0)
model = ensemble.RandomForestClassifier()
w0=1 #weight associated to 0's
w1=1 #weight associated to 1's
#I should split train and validation but for the sake of understanding sample_weights I'll skip this step
model.fit(X, y,sample_weight=np.array([w0 if r==0 else w1 for r in y]))
preds = model.predict(X)
probas = model.predict_proba(X)
ACC = metrics.accuracy_score(y,preds)
precision, recall, thresholds = metrics.precision_recall_curve(y, probas[:, 1])
fpr, tpr, thresholds = metrics.roc_curve(y, probas[:, 1])
ROC = metrics.auc(fpr, tpr)
cm = metrics.confusion_matrix(y,preds)
print "ACCURACY:", ACC
print "ROC:", ROC
print "F1 Score:", metrics.f1_score(y,preds)
print "TP:", cm[1,1], cm[1,1]/(cm.sum()+0.0)
print "FP:", cm[0,1], cm[0,1]/(cm.sum()+0.0)
print "Precision:", cm[1,1]/(cm[1,1]+cm[0,1]*1.1)
print "Recall:", cm[1,1]/(cm[1,1]+cm[1,0]*1.1)
- With
w0=w1=1
I get, for instance, F1=0.9456
.
- With
w0=w1=10
I get, for instance, F1=0.9569
.
- With
sample_weights=None
I get F1=0.9474
.