Does XGBoost handle multicollinearity by itself?

https://datascience.stackexchange.com/questions/12554

16-10-2019
|

Pergunta

I'm currently using XGBoost on a data-set with 21 features (selected from list of some 150 features), then one-hot coded them to obtain ~98 features. A few of these 98 features are somewhat redundant, for example: a variable (feature) $A$ also appears as $\frac{B}{A}$ and $\frac{C}{A}$.

My questions are :

How (If?) do Boosted Decision Trees handle multicollinearity?
How would the existence of multicollinearity affect prediction if it is not handled?

From what I understand, the model is learning more than one tree and the final prediction is based on something like a "weighted sum" of the individual predictions. So if this is correct, then Boosted Decision Trees should be able to handle co-dependence between variables.

Also, on a related note - how does the variable importance object in XGBoost work?

Solução

Decision trees are by nature immune to multi-collinearity. For example, if you have 2 features which are 99% correlated, when deciding upon a split the tree will choose only one of them. Other models such as Logistic regression would use both the features.

Since boosted trees use individual decision trees, they also are unaffected by multi-collinearity. However, its a good practice to remove any redundant features from any dataset used for training, irrespective of the model's algorithm. In your case since you're deriving new features, you could use this approach, evaluate each feature's importance and retain only the best features for your final model.

The importance matrix of an xgboost model is actually a data.table object with the first column listing the names of all the features actually used in the boosted trees. The second column is the Gain metric which implies the relative contribution of the corresponding feature to the model calculated by taking each feature's contribution for each tree in the model. A higher value of this metric when compared to another feature implies it is more important for generating a prediction.

Outras dicas

I was curious about this and made a few tests.

I’ve trained a model on the diamonds dataset, and observed that the variable “x” is the most important to predict whether the price of a diamond is higher than a certain threshold. Then, I’ve added multiple columns highly correlated to x, ran the same model, and observed the same values.

It seems that when the correlation between two columns is 1, xgboost removes the extra column before calculating the model, so the importance is not affected. However, when you add a column that is partially correlated to another, thus with a lower coefficient, the importance of the original variable x is lowered.

For example if I add a variable xy = x + y, the importance of both x and y decrease. Similarly, the importance of x decreases if I add new variables with r=0.4, 0.5 or 0.6, although just by a bit.

I think that collinearity is not a problem for boosting when you calculate the accuracy of the model, because the decision tree doesn’t care which one of the variables is used. However it might affect the importance of the variables, because removing one of the two correlated variables doesn't have a big impact on the accuracy of the model, given that the other contains similar information.

library(tidyverse)
library(xgboost)

evaluate_model = function(dataset) {
    print("Correlation matrix")
    dataset %>% select(-cut, -color, -clarity, -price) %>% cor %>% print

    print("running model")
    diamond.model = xgboost(
        data=dataset %>% select(-cut, -color, -clarity, -price) %>% as.matrix, 
        label=dataset$price > 400, 
        max.depth=15, nrounds=30, nthread=2, objective = "binary:logistic",
        verbose=F
        )

    print("Importance matrix")
    importance_matrix <- xgb.importance(model = diamond.model)
    importance_matrix %>% print
    xgb.plot.importance(importance_matrix)
    }

> diamonds %>% head
carat   cut color   clarity depth   table   price   x   y   z
0.23    Ideal   E   SI2 61.5    55  326 3.95    3.98    2.43
0.21    Premium E   SI1 59.8    61  326 3.89    3.84    2.31
0.23    Good    E   VS1 56.9    65  327 4.05    4.07    2.31
0.29    Premium I   VS2 62.4    58  334 4.20    4.23    2.63
0.31    Good    J   SI2 63.3    58  335 4.34    4.35    2.75
0.24    Very Good   J   VVS2    62.8    57  336 3.94    3.96    2.48

Evaluate a model on the diamonds data

We predict whether the price is higher than 400, given all numeric variables available (carat, depth, table, x, y, x)

Note that x is the most important variable, with an importance gain score of 0.375954.

evaluate_model(diamonds)
    [1] "Correlation matrix"
               carat       depth      table           x           y          z
    carat 1.00000000  0.02822431  0.1816175  0.97509423  0.95172220 0.95338738
    depth 0.02822431  1.00000000 -0.2957785 -0.02528925 -0.02934067 0.09492388
    table 0.18161755 -0.29577852  1.0000000  0.19534428  0.18376015 0.15092869
    x     0.97509423 -0.02528925  0.1953443  1.00000000  0.97470148 0.97077180
    y     0.95172220 -0.02934067  0.1837601  0.97470148  1.00000000 0.95200572
    z     0.95338738  0.09492388  0.1509287  0.97077180  0.95200572 1.00000000
    [1] "running model"
    [1] "Importance matrix"
       Feature       Gain      Cover  Frequency
    1:       x 0.37595419 0.54788335 0.19607102
    2:   carat 0.19699839 0.18015576 0.04873442
    3:   depth 0.15358261 0.08780079 0.27767284
    4:       y 0.11645929 0.06527969 0.18813751
    5:   table 0.09447853 0.05037063 0.17151492
    6:       z 0.06252699 0.06850978 0.11786929

Model trained on Diamonds, adding a variable with r=1 to x

Here we add a new column, which however doesn't add any new information, as it is perfectly correlated to x.

Note that this new variable is not present in the output. It seems that xgboost automatically removes perfectly correlated variables before starting the calculation. The importance gain of x is the same, 0.3759.

diamonds_xx = diamonds %>%
    mutate(xx = x + runif(1, -1, 1))
evaluate_model(diamonds_xx)
[1] "Correlation matrix"
           carat       depth      table           x           y          z
carat 1.00000000  0.02822431  0.1816175  0.97509423  0.95172220 0.95338738
depth 0.02822431  1.00000000 -0.2957785 -0.02528925 -0.02934067 0.09492388
table 0.18161755 -0.29577852  1.0000000  0.19534428  0.18376015 0.15092869
x     0.97509423 -0.02528925  0.1953443  1.00000000  0.97470148 0.97077180
y     0.95172220 -0.02934067  0.1837601  0.97470148  1.00000000 0.95200572
z     0.95338738  0.09492388  0.1509287  0.97077180  0.95200572 1.00000000
xx    0.97509423 -0.02528925  0.1953443  1.00000000  0.97470148 0.97077180
               xx
carat  0.97509423
depth -0.02528925
table  0.19534428
x      1.00000000
y      0.97470148
z      0.97077180
xx     1.00000000
[1] "running model"
[1] "Importance matrix"
   Feature       Gain      Cover  Frequency
1:       x 0.37595419 0.54788335 0.19607102
2:   carat 0.19699839 0.18015576 0.04873442
3:   depth 0.15358261 0.08780079 0.27767284
4:       y 0.11645929 0.06527969 0.18813751
5:   table 0.09447853 0.05037063 0.17151492
6:       z 0.06252699 0.06850978 0.11786929

Model trained on Diamonds, adding a column for x + y

We add a new column xy = x + y. This is partially correlated to both x and y.

Note that the importance of x and y is slightly reduced, going from 0.3759 to 0.3592 for x, and from 0.116 to 0.079 for y.

diamonds_xy = diamonds %>%
    mutate(xy=x+y)
evaluate_model(diamonds_xy)

[1] "Correlation matrix"
           carat       depth      table           x           y          z
carat 1.00000000  0.02822431  0.1816175  0.97509423  0.95172220 0.95338738
depth 0.02822431  1.00000000 -0.2957785 -0.02528925 -0.02934067 0.09492388
table 0.18161755 -0.29577852  1.0000000  0.19534428  0.18376015 0.15092869
x     0.97509423 -0.02528925  0.1953443  1.00000000  0.97470148 0.97077180
y     0.95172220 -0.02934067  0.1837601  0.97470148  1.00000000 0.95200572
z     0.95338738  0.09492388  0.1509287  0.97077180  0.95200572 1.00000000
xy    0.96945349 -0.02750770  0.1907100  0.99354016  0.99376929 0.96744200
              xy
carat  0.9694535
depth -0.0275077
table  0.1907100
x      0.9935402
y      0.9937693
z      0.9674420
xy     1.0000000
[1] "running model"
[1] "Importance matrix"
   Feature       Gain      Cover  Frequency
1:       x 0.35927767 0.52924339 0.15952849
2:   carat 0.17881931 0.18472506 0.04793713
3:   depth 0.14353540 0.07482622 0.24990177
4:   table 0.09202059 0.04714548 0.16267191
5:      xy 0.08203819 0.04706267 0.13555992
6:       y 0.07956856 0.05284980 0.13595285
7:       z 0.06474029 0.06414738 0.10844794

Model trained on Diamonds data, modified adding redundant columns

We add three new columns that are correlated to x (r = 0.4, 0.5 and 0.6) and see what happens.

Note that the importance of x gets reduced, dropping from 0.3759 to 0.279.

#' given a vector of values (e.g. diamonds$x), calculate three new vectors correlated to it
#' 
#' Source: https://stat.ethz.ch/pipermail/r-help/2007-April/128938.html
calculate_correlated_vars = function(x1) {

    # create the initial x variable
    #x1 <- diamonds$x

    # x2, x3, and x4 in a matrix, these will be modified to meet the criteria
    x234 <- scale(matrix( rnorm(nrow(diamonds) * 3), ncol=3 ))

    # put all into 1 matrix for simplicity
    x1234 <- cbind(scale(x1),x234)

    # find the current correlation matrix
    c1 <- var(x1234)

    # cholesky decomposition to get independence
    chol1 <- solve(chol(c1))

    newx <-  x1234 %*% chol1 

    # check that we have independence and x1 unchanged
    zapsmall(cor(newx))
    all.equal( x1234[,1], newx[,1] )

    # create new correlation structure (zeros can be replaced with other r vals)
    newc <- matrix( 
    c(1  , 0.4, 0.5, 0.6, 
      0.4, 1  , 0  , 0  ,
      0.5, 0  , 1  , 0  ,
      0.6, 0  , 0  , 1  ), ncol=4 )

    # check that it is positive definite
    eigen(newc)

    chol2 <- chol(newc)

    finalx <- newx %*% chol2 * sd(x1) + mean(x1)

    # verify success
    mean(x1)
    colMeans(finalx)

    sd(x1)
    apply(finalx, 2, sd)

    zapsmall(cor(finalx))
    #pairs(finalx)

    all.equal(x1, finalx[,1])
    finalx
}
finalx = calculate_correlated_vars(diamonds$x)
diamonds_cor = diamonds
diamonds_cor$x5 = finalx[,2]
diamonds_cor$x6 = finalx[,3]
diamonds_cor$x7 = finalx[,4]
evaluate_model(diamonds_cor)
[1] "Correlation matrix"
           carat        depth       table           x           y          z
carat 1.00000000  0.028224314  0.18161755  0.97509423  0.95172220 0.95338738
depth 0.02822431  1.000000000 -0.29577852 -0.02528925 -0.02934067 0.09492388
table 0.18161755 -0.295778522  1.00000000  0.19534428  0.18376015 0.15092869
x     0.97509423 -0.025289247  0.19534428  1.00000000  0.97470148 0.97077180
y     0.95172220 -0.029340671  0.18376015  0.97470148  1.00000000 0.95200572
z     0.95338738  0.094923882  0.15092869  0.97077180  0.95200572 1.00000000
x5    0.39031255 -0.007507604  0.07338484  0.40000000  0.38959178 0.38734145
x6    0.48879000 -0.016481580  0.09931705  0.50000000  0.48835896 0.48487442
x7    0.58412252 -0.013772440  0.11822089  0.60000000  0.58408881 0.58297414
                 x5            x6            x7
carat  3.903125e-01  4.887900e-01  5.841225e-01
depth -7.507604e-03 -1.648158e-02 -1.377244e-02
table  7.338484e-02  9.931705e-02  1.182209e-01
x      4.000000e-01  5.000000e-01  6.000000e-01
y      3.895918e-01  4.883590e-01  5.840888e-01
z      3.873415e-01  4.848744e-01  5.829741e-01
x5     1.000000e+00  5.925447e-17  8.529781e-17
x6     5.925447e-17  1.000000e+00  6.683397e-17
x7     8.529781e-17  6.683397e-17  1.000000e+00
[1] "running model"
[1] "Importance matrix"
   Feature       Gain      Cover  Frequency
1:       x 0.27947762 0.51343709 0.09748172
2:   carat 0.13556427 0.17401365 0.02680747
3:      x5 0.13369515 0.05267688 0.18155971
4:      x6 0.12968400 0.04804315 0.19821284
5:      x7 0.10600238 0.05148826 0.16450041
6:   depth 0.07087679 0.04485760 0.11251015
7:       y 0.06050565 0.03896716 0.08245329
8:   table 0.04577057 0.03135677 0.07554833
9:       z 0.03842355 0.04515944 0.06092608

There is an answer from Tianqi Chen (2018).

This difference has an impact on a corner case in feature importance analysis: the correlated features. Imagine two features perfectly correlated, feature A and feature B. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and Random Forests™).

However, in Random Forests™ this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature A and the other 50% will choose feature B. So the importance of the information contained in A and B (which is the same, because they are perfectly correlated) is diluted in A and B. So you won’t easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features…

In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, the reality is not always that simple). Therefore, all the importance will be on feature A or on feature B (but not both). You will know that one feature has an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them.

To summarise, Xgboost does not randomly use the correlated features in each tree, which random forest model suffers from such a situation.

Reference:

Tianqi Chen, Michaël Benesty, Tong He. 2018. “Understand Your Dataset with Xgboost.” https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html#numeric-v.s.-categorical-variables.

A remark on Sandeep's answer: Assuming 2 of your features are highly colinear (say equal 99% of time) Indeed only 1 feature is selected at each split, but for the next split, the xgb can select the other feature. Therefore, the xgb feature ranking will probably rank the 2 colinear features equally. Without some prior knowledge or other feature processing, you have almost no means from this provided ranking to detect that the 2 features are colinear.

Now, as for the relative importance that outputs the xgboost, it should be very similar (or maybe exactly similar) to the sklearn gradient boostined tree ranking. See here for explainations.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange