Pergunta

I face a problem with using SHAP value to interpret the Tree-based model.
(https://github.com/slundberg/shapsd)
First, I have input around 30 features and I have 2 features that have high positive correlation between them.
After that, I train the XGBoost model(python) and look at SHAP values of 2 features the SHAP values have negative correlation.

Could you all explain to me, why the output SHAP values between 2 features don't have the correlation the same as input correlation? and I can trust that output of SHAP or not?

=========================

The correlation between input: 0.91788
The correlation between SHAP values: -0.661088

2 features are
1) Pupulation in province and
2) Number of family in province.

Model Performance
Train AUC: 0.73
Test AUC: 0.71

Scatter plot
Input scatter plot (x: Number of family in province, y: Pupulation in province)
SHAP values output scatter plot (x: Number of family in province, y: Pupulation in province)

Foi útil?

Solução

I guess what you meant by correlation between SHAP values is "SHAP Interaction Value".

SHAP value is a measure how feature values are contributing a target variable in observation level. Likewise SHAP interaction value considers target values while correlation between features (Pearson, Spearman etc) does not involve target values therefore they might have different magnitudes and directions.

The features may grow together but their contribution to target variable in different intervals may reverse.

You may want to check docs and this beautiful work.

Licenciado em: CC-BY-SA com atribuição
scroll top