Some doubts related to statistic entropy concept in ID3 machine learning algorithm

https://stackoverflow.com/questions/17147183

31-05-2022
|

Question

I am studying the statistic entropy concept used by ID3 machine learning algorithm

For a domain exemplified by the learning set S (that is the set of the examples that I use to build a decision tree), the average amount of information I needed to classify an object is given by the entropy measiure

So I have the following formula:

enter image description here

So, for example:

If S is a collection of 14 examples with 9 YES and 5 NO examples then I have that:

Entropy(S) = - (9/14)*Log2(9/14) - (5/14)*Log2(5/14) = 0.940

This is pretty simple to calculate, my problem is that, on my book I also read this note:

Notice entropy is 0 if all members of S belong to the same class (the data is perfectly classified). The range of entropy is 0 ("perfectly classified") to 1 ("totally random").

This assertion is confusing me because I am trying to change the previous example in this way:

If S is a collection of 14 examples with 14 YES and 0 NO examples then I have that:

Entropy(S) = - (14/14)*Log2(14/14) - (0/14)*Log2(0/14) = 0 - infinity

So, in this case, I have that all the objecs belong to the same class (YES) and no examples belong to the NO class.

So I would expect that the entropy value of this S set will be 0 and no - infinity

What am I missing?

Tnx

Andrea

Solution

When calculating entropy you do a summation by iterating over the unique classification values at the node in question. You do this on each iteration by counting how many members of the set have the value, and then use the log formula. In your problem case, the only classification value that occurs is YES, meaning that entropy is zero based on the single iteration. You cannot iterate on a NO value because none of the examples have that value.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow