Question

I am trying to determine the root node for the decision tree on given data

enter image description here

annual income target variable has been renamed as low, mid, and high.

I am using gini index to measure the impurity of my nodes.

The process I am following is simple:

1- calculate the Gini index for the dataset(target is annual income)

gini(annual income)=1-((5/20)^2+(12/20)^2+(3/20)^2) = 0.445

2 - for each variable calculate gini and then remainder and information gain

3 - choose variable with the highest information gain

for remainder i am using this enter image description here

just instead of entropy, I am using gini

when I am trying to calculate information gain if education becomes root note I am getting a negative information gain (which is obviously not possible)

MY CALCULATION: enter image description here

as you can see I got a gini index of 0.532 for the node if I do

Information gain (0.445-0.532)=-ve value

can you point towards what am I doing wrong

Was it helpful?

Solution

I quickly checked your calculation and you seem to have miscalculated the gini(annual income)

gini(annual income)=1-((5/20)^2+(12/20)^2+(3/20)^2) = 0.445

When it actually equals 0.555 (you probably forgot the 1-... part) which is larger than 0.532 so you might be fine for the rest of the calculations.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top