Finding correlation in an enum type data

https://stackoverflow.com/questions/14304252

15-01-2022
|

Question

I have the following dataset containing information about countries

 5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,0,0,0,0,1,0,0,1,0,0,
 3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,
 4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,
 ...

The sixth column in each row indicates the main religion of the country: 0 is catholic, 1 is other christian, 2 is muslim, etc. Some of the other data is about if different colors are present in the flag of the country symbols they contain, and so on.

The description of the data can be found here. I have removed the string data columns though so it doesn't fit exactly like the information shown.

My problem is that I want to use co-variance matrices and Pearson correlation to see if, for example, the fact that a flag has the color red in it will tell anything about if the religion of that country has a bigger chance of being something than something else. But since the religion is enumerated, I am a bit lost on how to progress with this problem.

Solution

Your problem is that, despite the fact that your data is ordered, this order is arbitrary. The "distance" between "muslim" (enum val=1) to "hindu" (enum val=3) is not 2.

The most straight-forward way of tackling this issue is to convert enum values to binary indicator vectors:

Suppose you have

enum {
   Catholic = 0
   Protestant,
   Muslim,
   Jewish,
   Hindu,
   ...
   NumOfRel };

You replace the single entry of enum val with a binary vector of length NumOfRel with zeros everywhere except for a single 1 at the appropriate place:

For a Protestant entry, you'll have the following binary vector:

[ 0 1 0 0 ... ]

For a Jewish:

[ 0 0 0 1 0 ... ]

And so on...

This way, the "distance" between different religions is always 1.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow