Question

I am running xgboost on a regression classification problem where the model is predicting a score of how likely a gene is to cause a disease from 0-1.

I try to avoid overfitting in all the ways I can think of and the mean output of nested cross-validation is r2 0.88, I'm not sure if I can trust this or if there are other ways I can see if this is overfitting. The output r2 on just training and testing non-nested is: Train r2: 0.971 Test r2: 0.868.

So far I:

  • Remove features with a correlation >0.9 and remove any features with >50% missing data (this is hard to strengthen, a lot of genetic features simply have missing data for a lot of under studied genes in biology)
  • Have no imputation to avoid imputation bias, and since xgboost accepts missing data.
  • Scale features with MinMaxScaler() in scikit-learn - recommended as a good starting point and most features don't have a normal distribution
  • Compare 2 feature selection methods (one using features xgboost deems important from SHAP values and one using Boruta, both give 0.87-0.88 r2 on average of the 10 nested CV k-folds and only remove 3-4 out of 57 features)
  • Use nested kfold cross validation with 10 kfolds

The only other area I'm aware of that I haven't really explored is projection techniques. I am not sure which method would be best for this (my features are all numeric but mixed continuous or discrete data types) such as between UMAP, PCA or partial least squares.

Are there any other ways I can investigate overfitting? I have a biology background so any resources on this would be useful and any help appreciated.

I have also more manually removed some minority example genes before training (e.g. removed training genes with a 0.9 score which make up only about 1/8 of the training dataset) to give the trained model to predict and view how the model generalises to this 'new' hard to predict genes - gives them a 0.6-0.7 score when they are actually 0.9:

y_pred =[0.69412696, 0.709764, 0.6366122]

y_true = [0.9, 0.9, 0.9]

r2_score(y_true, y_pred) #outputs 0.0

10-fold nested cv r2 results per fold:

 'test_r2': array([0.8484691 , 0.86808136, 0.91821645, 0.93616375, 0.94435934,
       0.82065733, 0.84856025, 0.8267642 , 0.84561417, 0.89567455]

Edit:

A few other things I've tried:

  • I think I've misused classification here (and removed tag accordingly), I use regression models and I don't have labels and only continuous scores so I don't get true positives, false positives etc. to be able to do ROC. I'm not sure what other metrics are good or better than R2 for regression that I can use.

  • I have tried applying imputation to compare other models (random forest, SVM, and logistic rgeression with elasticnet or lasso), all models perform notably lower than gradient boosting (0.59 average nested r2 is the highest with random forest) - but I was originally concerned with biased data from imputation, is imputation worth doing to counteract overfitting?

  • I use GridSearch in scikit-learn for all my models with nested cross-validation, I should have included this information originally as I have been trying to always do this.

I have a biology background, so not sure about best practices for machine learning, but from this I'm suspecting random forest is better and I should be trying to do a better parameter tuning than I currently do for it, and trusting that model's result on nested CV. Is this the best approach?

Also not sure if how I tune my random forest is reasonable, currently I use:

rfr = RandomForestRegressor(random_state=seed)
rfr_params={'n_estimators':[100, 500, 1000], 
             'min_samples_split': [50, 100],
             'min_samples_leaf': [50, 100],} 
Was it helpful?

Solution

  1. The direct way to check your model for overfitting is to compare its performance on a training set with its performance on a testing set; overfitting is when your train score is significantly above your cv score.
    According to your comments, your r2 score is 0.97 on the training set, and 0.86 on your testing set (or similarly, 0.88 cv score, mean across 10 folds). That's somewhat overfitting, but not extremely so; think if 0.88 is "good enough" for your requirements

  2. The r2 score is 1 - MSE of errors / variance of true values. In the example you showed, all of the three true values were the same; i.e. their variance is zero. The r2 score should've been a negative infinite, but apparently sklearn corrects this to 0; you can verify that changing y_true to [0.9, 0.9, 0.90001] changes your r2 score to a very large negative number (around -2*10**9).
    This is why checking r2 against a small sample is not a good idea; the mean of the small sample contains too much important information.

  3. You added that you want to know which parameters to tune in order to prevent over-fitting. In your edit to your question, you said you're using grid-search over n_estimators (3 options), min_samples_split (2 options) and min_sample_leaf (2 options).
    There are other parameters you can try, and in my experience max_depth is important to tune.
    This question on Stack Overflow and this question on Cross Validated deal with overfitting, and there are good options there.
    I'd add that if you're trying many options, then maybe you'd be better off doing using Bayesian Optimization (there's a package that functions well with SKLearn: https://scikit-optimize.github.io/stable/auto_examples/sklearn-gridsearchcv-replacement.html).

OTHER TIPS

Overfitting can be identified by checking validation metrics such as accuracy and loss. The validation metrics usually increase until a point where they stagnate or start declining when the model is affected by overfitting.

If our model does much better on the training set than on the test set, then we’re likely overfitting.

You can use Occam's razor test: If two models have comparable performance, then you should usually pick the simpler one.

For linear regression, there is an excellent accelerated cross-validation method called predicted R-squared. This method doesn’t require you to collect a separate sample or partition your data, and you can obtain the cross-validated results as you fit the model. Statistical software calculates predicted R-squared using the following automated procedure:

  • It removes a data point from the dataset.
  • Calculates the regression equation.
  • Evaluates how well the model predicts the missing observation.
  • And, repeats this for all data points in the dataset.

Predicted R-squared has several cool features. First, you can just include it in the output as you fit the model without any extra steps on your part. Second, it’s easy to interpret. You simply compare predicted R-squared to the regular R-squared and see if there is a big difference.

If there is a large discrepancy between the two values, your model doesn’t predict new observations as well as it fits the original dataset. The results are not generalizable, and there’s a good chance you’re overfitting the model.


- Use RandomForest as XGBoost is more prone to overfitting and comparatively difficult to tune hyperparameters
Tune at least these parm -
param_grid = { 'n_estimators': [ ], 'max_features': [ ], 'max_depth' : [ ], 'criterion' :['gini', 'entropy']}

- Try imputation based on your domain knowledge and using other Features e.g. Correleation

- Scaling is not very much needed with Tree models

- Monitor another metrics along with $R^2$ score. I mean being in the domain you must know how much error is "too much". $R^2$ rewards useless Features, so be mindful of that and may use adjusted $R^2$.

- Have K=10 only when you have sufficient samples. Otherwise, try K=5,3. If we use K=10 on a small dataset, then the cross-val test-set will be very small and we may see a very high variance in the 10 different predictions. I suspect the same in your result. We have an output between 0.82 to 0.94
array([0.8484691 , 0.86808136, 0.91821645, 0.93616375, 0.94435934, 0.82065733, 0.84856025, 0.8267642 , 0.84561417, 0.89567455]

- Feature selection/engineering - A very separate and broad topic in itself. Would only suggest trying multiple things and trying one thing at a time and maintaining a proper track of which activities resulted in what. It seems from the question that you are trying to do many things randomly.

When evaluating xgboost (or any overfitting prone model), I would plot a validation curve. Validation curve shows the evaluation metric, in your case R2 for training and set and validation set for each new estimator you add. You would usually see both training and validation R2 increase early on, and if R2 for training is still increasing, while R2 for validation is starting to decrease, you know overfitting is a problem.

Be careful with overfitting a validation set. If your data set is not very large, and you are running a lot of experiments, it is possible to overfit the evaluation set. Therefore, the data is often split into 3 sets, training, validation, and test. Where you only tests models that you think are good, given the validation set, on the test set. This way you don't do a lot experiments against the test set, and don't overfit to it.

  1. You should be using an evaluation metric like area under the ROC curve not R^2. R^2 is good for continuous unbounded variables not classification. This is the most important thing you should do. If your outcome variable is highly imbalanced you might want to use precision recall. More about Precision-Recall and ROC.
  2. You need to do parameter tuning with Grid Search.
  3. It might be better to use random forest since sometimes boosting methods can overfit. You should also try logistic regression.
  4. I would avoid removing variables before training based on correlation.

I am happy to help further if you update your question to include correct metrics for classification problems.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top