How to determine feature importance while using xgboost in pipeline?

https://datascience.stackexchange.com/questions/16008

16-10-2019
|

Question

How to determine feature importance while using xgboost (XGBclassifier or XGBregressor) in pipeline?

AttributeError: 'Pipeline' object has no attribute 'get_fscore'

The answer provided here is similar but I couldn't get the idea.

Solution

As I found, there are two ways to determine feature importance: First:

print(grid_search.best_estimator_.named_steps["clf"].feature_importances_)

result:

[ 0.14582562  0.08367272  0.06409663  0.07631433  0.08705109  0.03827286
  0.0592836   0.05025916  0.07076083  0.0699278   0.04993521  0.07756387
  0.05095335  0.07608293]

Second:

print(grid_search.best_estimator_.named_steps["clf"].booster().get_fscore())

result:

{'f2': 1385, 'f11': 1676, 'f12': 1101, 'f6': 1281, 'f9': 1511, 'f7': 1086, 'f5': 827, 'f0': 3151, 'f10': 1079, 'f1': 1808, 'f3': 1649, 'f13': 1644, 'f8': 1529, 'f4': 1881}

Third:

print(grid_search.best_estimator_.named_steps["clf"].get_booster().get_fscore())

OTHER TIPS

Getting a reference to the xgboost object

You should first get the XGBClassifier or XGBRegressor element from the pipeline. You could do this either by getting the n-th element or by specifying the name.

clf = XGBClassifier()
pipe = Pipeline([('other', other_element), ('xgboost', clf)])

To get the XGBClassifier you could either:

use clf if you still have a reference to it
index the pipeline by name: pipe.named_steps['xgboost']
index the pipeline by location: pipe.steps[1]

Getting the importance

Secondly, it seems that importance is not implemented for the sklearn implementation of xgboost. See this github issue. A solution to add this to your XGBClassifier or XGBRegressor is also offered over their. It boils down to adding the methods to the class yourself.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange