Question

I am training an XGboost model for binary classification on around 60 sparse numeric features. After training, the feature importance distribution has one feature with importance > 0.6, and all the rest with importance <0.05.

I remove the most important feature, and retrain. The same distribution forms; the most important feature has importance > 0.6, and the rest have < 0.05. I continued to remove the most important feature and retrain, remove and retrain, remove and retrain, etc. My f1-score started to drop, but every time there was one feature more important than the rest.

Also worth noting, when I removed the most important feature and retrained, the new most important feature was not the second most important feature from the previous training.

I cannot explain this behaviour intuitively. Does anyone know why this pattern arises?

Was it helpful?

Solution

According to the docs "gain" is the default, which is calculated by this formula, where I don't think intuition would help. General consensus is, that feature importance is a tricky concept, as long as your model performs fine, you shouldn't worry about it. Especially with the linked formula for gain you can see that the first split may be the most important by far, depending on gamma, so that every other feature importance is low by default.

If, on the other hand you chose "weight", that would be just be the number of times the feature is chosen to split. That may mean that one feature contains a lot of unique values and can be chosen a lot of times and still improve accuracy.

All of this depends heavily on the data and parameters you are using. There isn't any general answer to this pattern.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top