Why do I get the same test results with trained complementary and standard Naive Bayes model?

StackOverflow https://stackoverflow.com/questions/23676093

Domanda

I have a question about Mahout: why do I get the same test results (the same model test accuracy - 80%) in confusion matrix when I test my trained Naive Bayes model with complementary model and standart model approach ?

Here are my steps, which I used:

  1. Transformation to vectors: # mahout seq2sparse --input /user/root/data-seq/chunk-0 --output /user/root/vectors -ow -wt tfidf -md 2 -x 95 -n 2 -nr 2
  2. Splitting training and test vectors: # mahout split --input data-vectors/tfidf-vectors --trainingOutput training-vectors --testOutput test-vectors --randomSelectionPct 30 --overwrite --sequenceFiles -xm sequential
  3. Learning a model: a) ComplementaryNaiveBayesClassifier: # mahout trainnb -i training-vectors -el -li labelindex -o model -ow -c b) StandardNaiveBayesClassifier: # mahout trainnb -i training-vectors -el -li labelindex -o model -ow
  4. Testing a model: a) ComplementaryNaiveBayesClassifier: # mahout testnb -i training-vectors -m model -l labelindex -ow -o tweets-testing -c b) StandardNaiveBayesClassifier: # mahout testnb -i training-vectors -m model -l labelindex -ow -o tweets-testing

Maybe because of Standard Naive Bayes does not use weight normalization but I used it in first step by setting parameter: -n 2 ? If it is true, means I should not use this parameter while creating a vectors if I want to compare these algorithms performance?

È stato utile?

Soluzione

The -n 2 option that you're referring to for mahout seq2sparse is actually the specifying the L_p norm for to use for length normalization[1]. So mahout seq2sparse ... -n 2 uses L_2 length normalization of the TF-IDF vectors. Alternatively you could use the -lnorm for log-normalization. This is part of the preprocessing step before used for both Complement and Standard Naive Bayes[2].

Weight normalization is different from length normalization and is not used in Mahout 0.7.

Weight normalization is used in the upcoming 1.0 release so to get the best comparison of Standard and Complement Naive Bayes you should checkout and build a copy of the latest trunk: http://mahout.apache.org/developers/buildingmahout.html.

You should see a significant difference between Standard and Complement Naive Bayes if you upgrade to the latest trunk.

[1] mahout.apache.org/users/basics/creating-vectors-from-text.html

[2] http://mahout.apache.org/users/classification/bayesian.html

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top