Question

I'm just started to use scikit-learn after years of datamining with SAS/SPSS products. I'm amazed by the capability of scikit-learn and pandas however there is one thing I can't figure out by myself. Let us assume that my training data is build up by integers, some of them encoding categorical values. Is there any way I can control how to interpret the variables by the tree or any ensemble tree (e.g.:ExtraTreesClassifier) algorihm? The proper way is to change the variable type from int to object, or is there a common trick I might learn?

Thanks, dealah

Was it helpful?

Solution

For low-cardinality categorical features it might be appropriate to use a one-hot encoding feature expansion. Have a look at:

For high cardinality categorical features, you can keep the integer encoding for ExtraTreesClassifier. Even though the algorithm will treat them as regular continuous variables, it does not seem to impact the predictive accuracy too negatively in practice.

Edit: in any case scikit-learn expect homogeneous floating point type encoding for all the input features. The object dtype is never a valid input type.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top