Handling Numerical Categorical Column in ML models in Python

https://datascience.stackexchange.com/questions/72638

10-12-2020
|

Question

When I was exploring the titanic dataset to estimate the probability of a person of surviving using the Logistic Model, I realized there are two ways of handling numerical categorical variables :

Use them as is in the dataset ie don't convert them into dummy variables. See the PClass below:
Convert the numerical categorical variable into dummy variables as below:

I had assumed both these would bear the same result but the model results are different using both these approaches. So, I wanted to understand the logic working behind both these approaches and which one should be preferred?

Solution

If you don't encode numerical categories with dummy variables, some models will end up being trained to use an ordering of numbers (e.g. 1 < 2 < 3 < 4 < 5 < ...) in their predictions. Whether or not this is desirable or useful depends on the context, and in particular on the meaning of the numerical categories and the model and implementation being used.

In your example, if the numerical labels for the Pclass categories represent "poverty class", with lower numbers corresponding to less income and higher numbers corresponding to higher income, then the ordering of the integers is really related to a meaningful ordering of the categories ( "1 < 2 < 3" and correspondingly "poverty class 1 is less wealthy than poverty class 2 is less wealthy than poverty class 3"). In this case, using the numerical encoding as it is doesn't appear to do any harm, and may even help some models. For example, a decision tree could learn rules like

if (Pclass >= 2): prediction = "survive"

whereas if the categories were encoded with dummy variables, this rule would instead look like:

if ((Pclass_2 == True) or (Pclass_3 == True)): prediction = "survive"

For some implementations/models, the first rule may be more likely to occur than the second, be more cost effective (just storing dummy variable data is more costly, let alone the effect it might have on model performance), generalize to new data better, etc.

On the other hand, if the numerical labels for the Pclass category represented, say, location of the room on the boat in which a person was staying (with 1 representing right next to the iceberg hit, 2 representing the super fancy ballroom with it's own getaway boat, 3 representing near-ish to the iceberg hit, 4 representing mid-tier, etc.), then the ordering of the integers appears to have no meaningful relationship to an ordering of the actual classes (that I can see anyway!). If one were to use the encoding for the numerical categories as they are, many model implementations could learn patterns in a "bad" way. For example, a decision tree could learn that the people near the ice berg hit had poor survival rates and end up producing the rule:

if Pclass <= 3: prediction = "does not survive"

This is "bad" both because it assumes being near the iceberg hit has something to do with the numerical encoding being less than 3 (it may not, for example future data might include many new category labels greater than 3 all of which encode rooms near the iceberg hit), and because the rule seems likely to produce many incorrect predictions for those with Pclass=2. If the categories were encoded with dummy variables, the rule learned might instead be:

if ((Pclass_1 == True) or (Pclass_3 == True)): prediction = "does not survive"

This rule is "better" in this context, as it does not share the two issues the previous rule did.

In general, I suggest the following rule of thumb:

Use dummy variables to encode numerically labeled categories if the natural ordering on the numerical labels has no meaningful interpretation in terms of the categories.

All else being equal, I tend to prefer using dummy variables by default when I have the space in memory, and then think on whether it makes sense to use integral encodings. In your situation, what does the Pclass represent, and is the ordering of the numerical categories actually meaningful to you? If not, I would recommend using dummy variables.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange