Question

I'm a little confused with this example I've been following online. Please correct me if anything is wrong before I get to my question! I know Bayes theorem is this:

P(A│B)= P(B│A) * P(A)  
         ----------             
            P(B)

In the example I'm looking at, classifying is being done on text documents. The text documents are all either "terrorism" or "entertainment", so:

Prior probability for either, i.e. P(A) = 0.5

There are six documents with word frequencies like so:

enter image description here

The example goes on to break down the frequency of these words in relation to each class, applying Laplace estimation:

enter image description here

So to my understanding each of these numbers represents the P(B|A), i.e. the probability of that word appearing given a particular class (either terrorism or entertainment).

Now a new document arrives, with this breakdown:

enter image description here

The example calculates the probability of this new text document relating to terrorism by doing this:

P(Terrorism | W) = P(Terrorism) x P(kill | Terrorism) x P(bomb | Terrorism) x P(kidnap | Terrorism) x P(music | Terrorism) x P(movie | Terrorism) x P(TV | Terrorism)

which works out as:

0.5 x 0.2380 x 0.1904 x 0.3333 x 0.0476 x 0.0952 x 0.0952

Again, up to now I think I'm following. The P(Terrorism | W) is P (A|B), P(Terrorism) = P(A) = 0.5 and P(B|A) = all the results for "terrorism" in the above table multiplied together.

But to apply it to this new document, the example calculates each of the P(B|A) above to the power of the new frequency. So the above calculation becomes:

0.5 x 0.2380^2 x 0.1904^1 x 0.3333^2 x 0.0476^0 x 0.0952^0 x 0.0952^1

From there they do a few sums which I get and find the answer. My question is:

Where in the formula does it say to apply the new frequency as a power to the current P(B|A)?

Is this just something statistical I don't know about? Is this universal or just a particular example of how to do it? I'm asking because all the examples I find are slightly different, using slightly different keywords and terms and I'm finding it just a tad confusing!

Was it helpful?

Solution

First of all, the formula

P(Terrorism | W) = P(Terrorism) x P(kill | Terrorism) x P(bomb | Terrorism) x P(kidnap | Terrorism) x P(music | Terrorism) x P(movie | Terrorism) x P(TV | Terrorism)

isn't quite right. You need to divide that by P(W). But you hint that this is taken care of later when it says that "they do a few sums", so we can move on to your main question.


Traditionally when doing Naive Bayes on text classification, you only look at the existence of words, not their counts. Of course you need the counts to estimate P(word | class) at train time, but at test time P("music" | Terrorism) typically means the probability that the word "music" is present at least once in a Terrorism document.

It looks like what the implementation you are dealing with is doing is it's trying to take into account P("occurrences of kill" = 2 | Terrorism) which is different from P("at least 1 occurrence of kill" | Terrorism). So why do they end up raising probabilities to powers? It looks like their reasoning is that P("kill" | Terrorism) (which they estimated at train time) represents the probability of an arbitrary word in a Terrorism document to be "kill". So by simplifying assumption, the probability of a second arbitrary word in a Terrorism document to be "kill" is also P("kill" | Terrorism).

This leaves a slight problem for the case that a word does not occur in a document. With this scheme, the corresponding probability is raised to the 0th power, in other words it goes away. In other words, it is approximating that P("occurrences of music" = 0 | Terrorism) = 1. It should be clear that in general, this is strictly speaking false since it would imply that P(occurrences of music" > 0 | Terrorism) = 0. But for real world examples where you have long documents and thousands or tens of thousands of words, most words don't occur in most documents. So instead of bothering with accurately calculating all those probabilities (which would be computationally expensive), they are basically swept under the rug because for the vast majority of cases, it wouldn't change the classification outcome anyway. Also note that on top of it being computationally intensive, it is numerically unstable because if you are multiplying thousands or tens of thousands of numbers less than 1 together, you will underflow and it will spit out 0; if you do it in log space, you are still adding tens of thousands of numbers together which would have to be handled delicately from a numerical stability point of view. So the "raising it to a power" scheme inherently removes unnecessary fluff, decreasing computational intensity, increasing numerical stability, and still yields nearly identical results.


I hope the NSA doesn't think I'm a terrorist for having used the word Terrorism so much in this answer :S

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top