문제

What is the best approach to include a zero-inflated continuous independent feature (e.g., 90% of the values are Zero, 10% are >0) in a Tree-based models (DT, random forest, gradient boosting. etc). I am considering the following three options:

  • Option 1: Keep the zero-inflated continues feature as-is.

  • Option 2: Replace the continuous feature with a binary interaction feature (i.e., 0 for X=0; 1 for X>0)

  • Option 3: Include both the continuous and categorical features.

The main justification I have for Option 1 is that the continuous feature can be used in more than one split. I am also aware that option 3 indicates including two highly correlated independent features. Will I be losing information if I use option 2?

Update: I found the following answer; however, I am not sure if it can be generalized for tree-based models

도움이 되었습니까?

해결책

Basically, for categorical random variable, the decision tree will try all potential splits and select the one that minimizes the mse. For continuous random variable, the decision tree tries to find the split which minimizes the mse. You may see that categorizing the data makes fewer potential choices for the split, and, therefore, is not optimal in accuracy. On the other hand, fewer splits to consider during the training decrease running time.

My answer is as follows.

If the train time permits, I would do option 1. This way, you don't lose any data. Maybe exact value of the feature is a key. It may the point especially if your response is unbalanced. Say, this feature may be a good predictor of your response if positive. For example, radiation level is zero for most of people, small positive values are OK, but large positive values are the key. However, we don't know how large values we should look for.

Option 2 speeds up training time which may be of benefit.

Option 3 creates unnecessary redundancy.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top