Pergunta

I have a dataset with 5K records and 60 features focused on binary classification. Class proportion is 33:67

Currently I am trying to increase the performance of my model which is stuck at F1-score of 89% (majority) and 75% (minority) class and AUC of 80%.

I tried Gridsearchcv and feature engineering. Though I don't explicity call out the best parameters on Gridsearch below, I guess when I fit, it takes the best parameters only. But nothing seems to help.

Does this mean my data has issues? When I mean issue, I am not talking about missing values. I mean the way the data was extracted. Can it be data entry issues?

This is what I tried for gridsearchcv. Am I doing it right?

import xgboost as xgb
parameters_xgb = {
    'learning_rate': (0.1,0.01,0.05,0.5,0.3,1),
    'n_estimators': (100,200,500,1000),
    'max_depth':(5,10,20),}
xg_clf = xgb.XGBClassifier()
xgb_clf_gv = GridSearchCV(xg_clf,parameters_xgb,cv=5)  # using cross validation with best hyperparameters
xgb_clf_op = xgb_clf_gv.fit(X_train_std,y_train)
y_pred = xgb_clf_op.predict(X_test_std)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("Accuracy is ", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

I also tried catboost and gb. The AUC is only around 80-82% throughout in test data.

Foi útil?

Solução

Performance increase with hyperparameter optimization

I would not necessarily call it data issues. There is always some threshold that you just can not surpass, depending on the dataset ofcourse. Generally feature engineering and understanding the data will yield much greater increases than just hyp.par. optimization, which as you can see from the picture, yields often marginal increases (there is a case where its worst than default parameters)

Outras dicas

Why is 80% bad? Is there precedence to suggest you would expect higher? I have ideas at 60% that return millions of $. Maybe not a perfect model but one must not always expect perfect separation.

Licenciado em: CC-BY-SA com atribuição
scroll top