Question on decision tree in the book Programming Collective Intelligence

https://datascience.stackexchange.com/questions/5606

16-10-2019
|

Question

I'm currently studying Chapter 7 ("Modeling with Decision Trees") of the book "Programming Collective intelligence".

I find the output of the function mdclassify() p.157 confusing. The function deals with missing data. The explanation provided is:

In the basic decision tree, everything has an implied weight of 1, meaning that the observations count fully for the probability that an item fits into a certain category. If you are following multiple branches instead, you can give each branch a weight equal to the fraction of all the other rows that are on that side.

From what I understand, an instance is then split between branches.

Hence, I simply don't understand how we can obtain:

{'None': 0.125, 'Premium': 2.25, 'Basic': 0.125}

as 0.125+0.125+2.25 does not sum to 1 nor even an integer. How was the new observation split?

The code is here:

https://github.com/arthur-e/Programming-Collective-Intelligence/blob/master/chapter7/treepredict.py

Using the original dataset, I obtain the tree shown here:

Can anyone please explain me precisely what the numbers precisely mean and how they were exactly obtained?

PS : The 1st example of the book is wrong as described on their errata page but just explaining the second example (mentioned above) would be nice.

Solution

There are four features:

referer,
location,
FAQ,
pages.

In your case, you're trying to classify an instance where FAQ and pages are unknown: mdclassify(['google','France',None,None], tree).

Since the first known attribute is google, in your decision tree you're only interested in the edge that comes out of google node on the right-hand side.

There are five instances: three labeled Premium, one labeled Basic and one labeled None.

Instances with labels Basic and None split on the FAQ attribute. There's two of them, so the weight for both of them is 0.5.

Now, we split on the pages attribute. There are 3 instances with pages value larger than 20, and two with pages value no larger than 20.

Here's the trick: we already know that the weights for two of these were altered from 1 to 0.5 each. So, now we have three instances weighted 1 each, and 2 instances weighted 0.5 each. So the total value is 4.

Now, we can count the weights for pages attribute:

pages_larger_than_20 = 3/4
pages_not_larger_than_20 = 1/4 # the 1 is: 0.5 + 0.5

All weights are ascribed. Now we can multiply the weights by the "frequencies" of instances (remembering that for Basic and None the "frequency" is now 0.5):

Premium: 3 * 3/4 = 2.25 # because there are three Premium instances, each weighting 0.75;
Basic: 0.5 * 1/4 = 0.125 # because Basic is now 0.5, and the split on pages_not_larger_than_20 is 1/4
None: 0.5 * 1/4 = 0.125 # analogously

That's at least where the numbers come from. I share your doubts about the maximum value of this metric, and whether it should sum to 1, but now that you know where these numbers come from you can think how to normalize them.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange