Question

From the documentation, it appears that DecisionTreeClassifier supports multiclass features

DecisionTreeClassifier is capable of both binary (where the labels are [-1, 1]) classification and multiclass (where the labels are [0, ..., K-1]) classification.

But, it appears that the decision rule in each node is based on 'greater then'

I'm trying to build trees with enum features (where there is no meaning for the absolute value of each feature - just equal \ not equal)

Is this supported in scikit-learn decision trees?

My current solution is to separate each feature to a set of binary features for each possible value - but i'm looking for a cleaner and more efficient solution.

Was it helpful?

Solution

The term multiclass only affects the target variable: for the random forest in scikit-learn it is either categorical with an integer coding for multiclass classification or continuous for regression.

"Greater-than" rules apply to the input variables independently of the kind of target variable. If you have categorical input variables with a low dimensionality (e.g. less than a couple of tens of possible values) then it might be beneficial to use a one-hot-encoding for those. See:

  • OneHotEncoder if your categories are encoded as integers,
  • DictVectorizer if your categories are encoded as string labels in a list of python dict.

If some of the categorical variables have a high cardinality (e.g. thousands of possible values or more) then it has been shown experimentally that DecisionTreeClassifiers and better models based on them such as RandomForestClassifiers can be trained directly on the raw integer coding without converting it to a one-hot-encoding that would waste memory or model size.

OTHER TIPS

DecisionTreeClassifier is certainly capable of multiclass classification. The "greater than" just happens to be illustrated in that link, but arriving at that decision rule is a consequence of the affect it has on the information gain or the gini (see later in that page). Decision tree nodes generally have binary rules, so they typically take the form of some value being greater than another. The trick is transforming your data so it has good predictive values to compare.

To be clear, multiclass means your data (say a document) is to be classified as one of a set of possible classes. This is different from multilabel classification, where the document needs to be classified with several classes out of a set of possible classes. Most of the scikit-learn classifiers support multiclass, and it has a few meta-wrappers to accomplish multilabeling. You can also use probabilities (models with the predict_proba method) or decision function distances (models with the decision_function method) for multilabeling.

If you are saying you need to apply multiple labels to each datum (like ['red','sport','fast'] to cars), then you need to create a unique label for each possible combination to use trees/forests, which becomes your [0...K-1] set of classes. However, it implies that there is some predictive correlation in the data (for combined color, type, and speed in the cars example). For cars, there may be with red/yellow, fast sports cars, but unlikely for other 3-way combinations. Data may be strongly predictive for those few and very weak for the rest. Better off using SVM or LinearSVC and/or wrapping with OneVsRestClassifier or similar.

There is a Python package called DecisionTree https://engineering.purdue.edu/kak/distDT/DecisionTree-2.2.2.html which i find very helpful.

This is not directly related to your scikit/sklearn problem, but be helpful to others. Also, I always go to pyindex, when I am looking for python tools. https://pypi.python.org/pypi/pyindex

Thanks

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top