Pergunta

I'm working on an employee attrition predictive model using sklearn's GradientBoostingClassfier. I have 9,000 observations, which I split 50/50 for training and testing. I have another set of 1,200 observations that I use for a final validation. All 10,200 observations were obtained in similar fashion.

I carried out a grid search with 5-fold cross-validation in order to obtain a suitable set of hyper parameters. The results for my test set are good and very stable. However, there is a big drop off in performance when use my final validation data.

Results for the test set

->  Precision: 0.836 / Recall: 0.629 / Accuracy 0.874

Results for the final validation set

->  Precision: 0.149 / Recall: 0.725 / Accuracy 0.484

At first I thought this could be the caused by data leakage, but even after removing "suspicious" features, there is still a big drop off when comparing the test results with the final validation results.

Surely I'm doing something wrong, but I'm at a loss as to what exactly. Here are the relevant lines of code (nothing fancy):

> X = pd.read_csv('train_test.csv')
> y = X.pop('Target')
> X_final = pd.read_csv('final_validation.csv')
> y_final = X_final.pop('Target')

> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

> gb = GradientBoostingClassifier(n_estimators=300, max_depth=5, learning_rate=0.2)
> gb_model = gb.fit(X_train, y_train)

> # test set
> y_pred = gb_model.predict(X_test)
> precision, recall, fscore, support = score(y_test, y_pred, average='binary')

> # final validation set
> y_hat = gb_model.predict(X_final)
> precision, recall, fscore, support = score(y_final, y_hat, average='binary')

Any thoughts?

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
scroll top