Question

In a Random Forest context, do I need to setup dummies/OnehotEncoder in a dataset where features/varibles are numerical but refer to some kind of category?

Let's say I have the following variables:

enter image description here

Where Y is the variable I want to predict. X's are features.

I will focus on X1. Its numerical but refers to a specific category (i.e. 1 refers to math, 2 refers to literature and 3 for history). Do I need to apply OnehotEncoder (or dummy approach) for a Random Forest algoritm?

I guess I don't need to do it, but I 'm not sure.

Was it helpful?

Solution

In theory, categories themselves can be handled in decision trees. However, most python implementations will only work with numbers, so you need to convert them. It looks like you already have. OneHot Encoding is one way to do that, but now that you've done that, it isn't necessary. You may try other schemes to turn them into 1 2 3, if there isn't a logical reason to how you have them this way. This is called Categorical Encoding. One popular way is to order them alphabetically and number them. Another way, in order of their frequency. At the end of the day, it won't matter a whole lot because the random forest will bin the number according to its algorithm. It might make minor differences though, so maybe try different ways. But no, you won't need to OneHot encode your categories because they are already numbers.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top