Question on decision tree in the book Programming Collective Intelligence
-
16-10-2019 - |
Question
I'm currently studying Chapter 7 ("Modeling with Decision Trees") of the book "Programming Collective intelligence".
I find the output of the function mdclassify()
p.157 confusing. The function deals with missing data. The explanation provided is:
In the basic decision tree, everything has an implied weight of 1, meaning that the observations count fully for the probability that an item fits into a certain category. If you are following multiple branches instead, you can give each branch a weight equal to the fraction of all the other rows that are on that side.
From what I understand, an instance is then split between branches.
Hence, I simply don't understand how we can obtain:
{'None': 0.125, 'Premium': 2.25, 'Basic': 0.125}
as 0.125+0.125+2.25
does not sum to 1 nor even an integer. How was the new observation split?
The code is here:
https://github.com/arthur-e/Programming-Collective-Intelligence/blob/master/chapter7/treepredict.py
Using the original dataset, I obtain the tree shown here:
Can anyone please explain me precisely what the numbers precisely mean and how they were exactly obtained?
PS : The 1st example of the book is wrong as described on their errata page but just explaining the second example (mentioned above) would be nice.
Solution
There are four features:
- referer,
- location,
- FAQ,
- pages.
In your case, you're trying to classify an instance where FAQ
and pages
are unknown: mdclassify(['google','France',None,None], tree)
.
Since the first known attribute is google
, in your decision tree you're only interested in the edge that comes out of google
node on the right-hand side.
There are five instances: three labeled Premium
, one labeled Basic
and one labeled None
.
Instances with labels Basic
and None
split on the FAQ
attribute. There's two of them, so the weight for both of them is 0.5
.
Now, we split on the pages
attribute. There are 3 instances with pages
value larger than 20, and two with pages
value no larger than 20.
Here's the trick: we already know that the weights for two of these were altered from 1
to 0.5
each. So, now we have three instances weighted 1
each, and 2 instances weighted 0.5
each. So the total value is 4
.
Now, we can count the weights for pages
attribute:
- pages_larger_than_20 = 3/4
- pages_not_larger_than_20 = 1/4 # the 1 is: 0.5 + 0.5
All weights are ascribed. Now we can multiply the weights by the "frequencies" of instances (remembering that for Basic
and None
the "frequency" is now 0.5
):
Premium
:3 * 3/4 = 2.25
# because there are threePremium
instances, each weighting0.75
;Basic
:0.5 * 1/4 = 0.125
# becauseBasic
is now0.5
, and the split onpages_not_larger_than_20
is1/4
None
:0.5 * 1/4 = 0.125
# analogously
That's at least where the numbers come from. I share your doubts about the maximum value of this metric, and whether it should sum to 1, but now that you know where these numbers come from you can think how to normalize them.