My response comes from a class I took that used these slides (see slide 20).
The statement there is that there is no easy way to deal with categorical predictors with a large number of categories. Also, I know that decision trees and random forests will automatically prefer to split on categorical predictors with a large number of categories.
A few recommended solutions:
- Bin your categorical predictor into fewer bins (that are still meaningful to you).
- Order the predictor according to means (slide 20). This is my Prof's recommendation. But what it would lead me to is using an
ordered factor
inR
- Finally, you need to be careful about the influence of this categorical predictor. For example, one thing I know that you can do with the
randomForest
package is to set therandomForest
parametermtry
to a lower number. This controls the number of variables that the algorithm looks through for each split. When it's set lower you'll have fewer instances of your categorical predictor appear vs. the rest of the variables. This will speed up estimation times, and allow the advantage of decorrelation from therandomForest
method ensure you don't overfit your categorical variable.
Finally, I'd recommend looking at the MARS or PRIM methods. My professor has some slides on that here. I know that PRIM is known for being low in computational requirement.