Question

I am using xgboost and have a categorical unordered feature with 25 levels. So when i apply one hot encoding i have 25 columns. This introduces alot of sparsity. Even more unusual, my feature importance report shows 5 of these one hot encoded column in top 10, with one of them appearing at the top.

I tried to see if there was a diffrence in percentage of these categories between my binary classes (1, 0) but there isn't so i am a little perplexed as to why it is assigning such a high feature importance to them.

i have read online that if we have a categorical variable with q levels, the tree has to choose from ((2^q/2)-1) splits. For a dummy variable, there is only one possible split and this induces sparsity

i am not sure i understand this, say i have a column called color: red, green, blue,yellow , and i implement one hot encoding so is the number of splits that happen is 2^4/2 -1 = 3? if this increases as i have e.g. 2^25/2 -1, more splits means the tree is more likely to find a 'good split' for data at hand and lead to overfitting? But what i don't understand is how this splitting chages with dummy variables.. does that equation hold or not for one hot endoded variables.

am i interpreting this correctly?

sources elemts of statisicatl learning: enter image description here

https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769#:~:text=For%20every%20tree%2Dbased%20algorithm,a%20feature%20and%20a%20value.&text=The%20trees%20generally%20tend%20to,values%20(0%20or%201).

Was it helpful?

Solution

i have read online that if we have a categorical variable with q levels, the tree has to choose from ((2^q/2)-1) splits. For a dummy variable, there is only one possible split and this induces sparsity

i am not sure i understand this, say i have a column called color: red, green, blue,yellow , and i implement one hot encoding so is the number of splits that happen is 2^4/2 -1 = 3?...

You have the order of the operations wrong there (probably because the "((2^q/2)-1)" you quote above is misleading, but compare to the ESL quote in your image): it's $$ \frac{2^4}{2}-1 = 2^{4-1}-1 = 7$$ possible splits, namely:
(red green blue) vs (yellow)
(red green yellow) vs (blue)
(red blue yellow) vs (green)
(green blue yellow) vs (red)
(red green) vs (blue yellow)
(red blue) vs (green yellow)
(red yellow) vs (green blue)

if this increases as i have e.g. 2^25/2 -1, more splits means the tree is more likely to find a 'good split' for data at hand and lead to overfitting?...

It is certainly true there are more possible splits, and that increases the model's capacity and hence perhaps increases overfitting. This is particularly worrisome if some of the levels are quite rare, and less so if you have a lot of data.

But what i don't understand is how this splitting chages with dummy variables.. does that equation hold or not for one hot endoded variables.

No, when you one-hot encode such a feature, the tree now must split on only one of those dummy variables (at a time). So, considering all the $q$ levels' new indicator variables, you have exactly $q$ splits to consider. In your example, they are the first four splits given above:
is_yellow=0 vs is_yellow=1
is_blue=0 vs is_blue=1
is_green=0 vs is_green=1
is_red=0 vs is_red=1

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top