Question

I have a Naive Bayes classifier (implemented with WEKA) that looks for uppercase letters.

contains_A
contains_B
...
contains_Z

For a certain class the word LCD appears in almost every instance of the training data. When I get the probability for "LCD" to belong to that class it is something like 0.988. win.

When I get the probability for "L" I get a plain 0 and for "LC" I get 0.002. Since features are naive, shouldn't the L, C and D contribute to overall probability independently, and as a result "L" have some probability, "LC" some more and "LCD" even more?

At the same time, the same experiment with an MLP, instead of having the above behavior it gives percentages of 0.006, 0.5 and 0.8

So the MLP does what I would expect a Naive Bayes to do, and vise versa. Am I missing something, can anyone explain these results?

No correct solution

OTHER TIPS

I am not familiar with the internals of WEKA - so please correct me if you think that I am not righth.

When using a text as a "feature" than this text is transformed to a vector of binary values. Each value correponds to one concrete word. The length of the vector is equal to the size of the dictionary.

if your dictionary contains 4 worlds: LCD, VHS, HELLO, WORLD then for example a text HELLO LCD will be transformed to [1,0,1,0].

I do not know how WEKA builds it's dictionary, but I think it might go over all the words present in the examples. Unless the "L" is present in the dictionary (and therefor is present in the examples) than it's probability is logicaly 0. Actually it should not even be considered as a feature.

Actually you can not reason over the probabilities of the features - and you cannot add them together, I think there is no such a relationship between the features.

Beware that in text mining, words (letters in your case) may be given weights different than their actual counts if you are using any sort of term weighting and normalization, e.g. tf.idf. In the case of tf.idf for example, characters counts are converted into a logarithmic scale, also characters that appear in every single instance may be penalized using idf normalization.

I am not sure what options you are using to convert your data into Weka features, but you can see here that Weka has parameters to be set for such weighting and normalization options

http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html

-T Transform the word frequencies into log(1+fij) where fij is the frequency of word i in jth document(instance).

-I Transform each word frequency into: fij*log(num of Documents/num of documents containing word i) where fij if frequency of word i in jth document(instance)

I checked the weka documentation and I didn't see support for extracting letters as features. This implies the weka function may need a space or punctuation to delimit each feature from those adjacent. If so, then the search for "L", "C" and "D" would be interpreted as three separate one-letter-words and would explain why they were not found.

If you think this is it, you could try splitting the text into single characters delimited by \n or space, prior to ingestion.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top