Come pre-elaborare diversi tipi di dati (continua, discreta, categoriali) prima della decisione Learning Tree

https://datascience.stackexchange.com/questions/6721

16-10-2019
|

Domanda

Voglio usare un po 'di decisione Albero di apprendimento, come ad esempio il classificatore Foresta a caso.

Non ho i dati di diverse tipologie: continuo, discreto e categorica. Come devo dati pre-elaborazione al fine di avere risultati coerenti?

Soluzione

One of the benefits of decision trees is that ordinal (continuous or discrete) input data does not require any significant preprocessing. In fact, the results should be consistent regardless of any scaling or translational normalization, since the trees can choose equivalent splitting points. The best preprocessing for decision trees is typically whatever is easiest or whatever is best for visualization, as long as it doesn't change the relative order of values within each data dimension.

Categorical inputs, which have no sensible order, are a special case. If your random forest implementation doesn't have a built-in way to deal with categorical input, you should probably use a 1-hot encoding:

If a categorical value has $n$ categories, you encode the value using $n$ dimensions, one corresponding to each category.
For each data point, if it is in category $k$, the corresponding $k$th dimension is set to 1, while the rest are set to 0.

This 1-hot encoding allows decision trees to perform category equality tests in one split since inequality splits on non-ordinal data doesn't make much sense.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a datascience.stackexchange