Question

I want to learn a decision tree having a reasonable discrete target attribute with 5 possible different values. However, there are discrete high cardinality input attributes (1000s of different possible string values) that I wonder if it makes sense to include them. Is there any policy what the maximum cardinality should be when including an attribute to train a decision tree?

Was it helpful?

Solution

There is no maximum cardinality, no. Of course, you could omit values that do not actually appear in the data.

You will have to use an RDF implementation that handles multi-label categorical features directly rather than converts them to a series of binary indicator features.

For a categorical feature with N values there are 2^N - 2 possible decision rules on the feature, which is too many to consider by a long way. The heuristic I have used is to compute the entropy of the target when you divide up the data by the N categorical feature values. Then order the values by entropy and evaluate the N-2 rules you get by considering prefixes of that list.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top