Question

Usually if I have regression problem and my initial dataset contains categorical variables like :

column 1:  
Math
Science 
Science 
English 

I would convert this non-numerical variables to numerical variable such that : Math: 0, Science : 1, English : 2. However, I recently found a tutorial said that this solution is not performant because there is no favorite class among other means there is no increase between those classes and if it existe we can not quantify it.

Can anyone explain this for me because I usually worked with solution one ?

Was it helpful?

Solution

This solution would be performant only if your values has an order. Some models use as learning function the distance between points, and if you use your method, a student in Math and a student in English (0 and 2 making a 2 distance) will have more distance than a student in Math and a student in Science (0 and 1 making a 1 distance). Using this method involves a bias, so you'll have to go another way. One well known method is One Hot Encoding, which will create 3 binary-variables Column1_Math, Column1_Science, Column1_English, with values 0 or 1 (for example, if column 1 is Math, then you'll have Column1_Math = 1, Column1_Science = 0, Column1_English = 0). This way, you avoid biaising your model.

I already explained other ways I know to deal with your issue in this answer that I highly suggest you to take a look at

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top