Understanding this application of a Naive Bayes Classifier

Question

First of all, the formula

P(Terrorism | W) = P(Terrorism) x P(kill | Terrorism) x P(bomb | Terrorism) x P(kidnap | Terrorism) x P(music | Terrorism) x P(movie | Terrorism) x P(TV | Terrorism)

isn't quite right. You need to divide that by P(W). But you hint that this is taken care of later when it says that "they do a few sums", so we can move on to your main question.

Traditionally when doing Naive Bayes on text classification, you only look at the existence of words, not their counts. Of course you need the counts to estimate P(word | class) at train time, but at test time P("music" | Terrorism) typically means the probability that the word "music" is present at least once in a Terrorism document.

It looks like what the implementation you are dealing with is doing is it's trying to take into account P("occurrences of kill" = 2 | Terrorism) which is different from P("at least 1 occurrence of kill" | Terrorism). So why do they end up raising probabilities to powers? It looks like their reasoning is that P("kill" | Terrorism) (which they estimated at train time) represents the probability of an arbitrary word in a Terrorism document to be "kill". So by simplifying assumption, the probability of a second arbitrary word in a Terrorism document to be "kill" is also P("kill" | Terrorism).

This leaves a slight problem for the case that a word does not occur in a document. With this scheme, the corresponding probability is raised to the 0th power, in other words it goes away. In other words, it is approximating that P("occurrences of music" = 0 | Terrorism) = 1. It should be clear that in general, this is strictly speaking false since it would imply that P(occurrences of music" > 0 | Terrorism) = 0. But for real world examples where you have long documents and thousands or tens of thousands of words, most words don't occur in most documents. So instead of bothering with accurately calculating all those probabilities (which would be computationally expensive), they are basically swept under the rug because for the vast majority of cases, it wouldn't change the classification outcome anyway. Also note that on top of it being computationally intensive, it is numerically unstable because if you are multiplying thousands or tens of thousands of numbers less than 1 together, you will underflow and it will spit out 0; if you do it in log space, you are still adding tens of thousands of numbers together which would have to be handled delicately from a numerical stability point of view. So the "raising it to a power" scheme inherently removes unnecessary fluff, decreasing computational intensity, increasing numerical stability, and still yields nearly identical results.

I hope the NSA doesn't think I'm a terrorist for having used the word Terrorism so much in this answer :S