I'm trying to do some machine learning stuff that involves a lot of factor-type variables (words, descriptions, times, basically non-numeric stuff). I usually rely on randomForest but it doesn't work w/factors that have >32 levels.

Can anyone suggest some good alternatives?

有帮助吗?

解决方案

Tree methods won't work, because the number of possible splits increases exponentially with the number of levels. However, with words this is typically addressed by creating indicator variables for each word (of the description etc.) - that way splits can use a word at a time (yes/no) instead of picking all possible combinations. In general you can always expand levels into indicators (and some models do that implicitly, such as glm). The same is true in ML for handling text with other methods such as SVM etc. So the answer may be that you need to think about your input data structure, not as much the methods. Alternatively, if you have some kind of order on the levels, you can linearize it (so there are only c-1 splits).

其他提示

In general the best package I've found for situations where there are lots of factor levels is to use the gbm package.

It can handle up to 1024 factor levels.

If there are more than 1024 levels I usually change the data by keeping the 1023 most frequently occurring factor levels and then code the remaining levels as one level.

There is nothing wrong in theory with the use of randomForest's method on class variables that have more than 32 classes - it's computationally expensive, but not impossible to handle any number of classes using the randomForest methodology. The normal R package randomForest sets 32 as a max number of classes for a given class variable and thus prohibits the user from running randomForest on anything with > 32 classes for any class variable.

Linearlizing the variable is a very good suggestion - I've used the method of ranking the classes, then breaking them up evenly into 32 meta-classes. So if there are actually 64 distinct classes, meta-class 1 consists of all things in class 1 and 2, etc. The only problem here is figuring out a sensible way of doing the ranking - and if you're working with, say, words it's very difficult to know how each word should be ranked against every other word.

A way around this is to make n different prediction sets, where each set contains all instances with any particular subset of 31 of the classes in each class variable with more than 32 classes. You can make a prediction using all sets, then using variable importance measures that come with the package find the implementation where the classes used were most predictive. Once you've uncovered the 31 most predictive classes, implement a new version of RF using all the data that designates these most predictive classes as 1 through 31, and everything else into an 'other' class, giving you the max 32 classes for the categorical variable but hopefully preserving much of the predictive power.

Good luck!

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top