Pergunta

I want to use some Decision Tree learning, such as the Random Forest classifier.

I have data of different types: continuous, discrete and categorical. How do I have to preprocess data in order to have consistent results?

Foi útil?

Solução

One of the benefits of decision trees is that ordinal (continuous or discrete) input data does not require any significant preprocessing. In fact, the results should be consistent regardless of any scaling or translational normalization, since the trees can choose equivalent splitting points. The best preprocessing for decision trees is typically whatever is easiest or whatever is best for visualization, as long as it doesn't change the relative order of values within each data dimension.

Categorical inputs, which have no sensible order, are a special case. If your random forest implementation doesn't have a built-in way to deal with categorical input, you should probably use a 1-hot encoding:

  • If a categorical value has $n$ categories, you encode the value using $n$ dimensions, one corresponding to each category.
  • For each data point, if it is in category $k$, the corresponding $k$th dimension is set to 1, while the rest are set to 0.

This 1-hot encoding allows decision trees to perform category equality tests in one split since inequality splits on non-ordinal data doesn't make much sense.

Licenciado em: CC-BY-SA com atribuição
scroll top