Question

I've been running a dataset through Weka, applying NB. I stuck on the following problem: while I was analyzing it, I noticed the difference between total number in attributes section, and total instances appeared in log.

If you sum "a0" attribute, you'll notice Weka points 1044 instances. If you check "Instances", it is 1036.

Dataset, actually, contains 1036 instances.

Does anyone have a explanation about it? Thanks.

Here's a log paste:

=== Run information ===

Scheme:       weka.classifiers.bayes.NaiveBayes 
Relation:     teste.carro
Instances:    1036
Attributes:   7
              a0
              a1
              a2
              a3
              a4
              a5
              class
Test mode:    evaluate on training data

=== Classifier model (full training set) ===

Naive Bayes Classifier

               Class
Attribute          0     1
               (0.5) (0.5)
===========================
a0
  1             105.0 175.0
  2             112.0 165.0
  3             153.0 109.0
  4             152.0  73.0
  [total]       522.0 522.0

a1
  1             101.0 165.0
  2             123.0 165.0
  3             136.0 119.0
  4             162.0  73.0
  [total]       522.0 522.0

a2
  1             150.0 107.0
  2             122.0 133.0
  3             121.0 141.0
  4             129.0 141.0
  [total]       522.0 522.0

a3
  1             247.0   1.0
  2             134.0 265.0
  3             140.0 255.0
  [total]       521.0 521.0

a4
  1             189.0 127.0
  2             177.0 185.0
  3             155.0 209.0
  [total]       521.0 521.0

a5
  1             244.0   1.0
  2             160.0 220.0
  3             117.0 300.0
  [total]       521.0 521.0



Time taken to build model: 0 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0.01 seconds

=== Summary ===

Correctly Classified Instances         957               92.3745 %
Incorrectly Classified Instances        79                7.6255 %
Kappa statistic                          0.8475
Mean absolute error                      0.1564
Root mean squared error                  0.2398
Relative absolute error                 31.2731 %
Root relative squared error             47.9651 %
Coverage of cases (0.95 level)         100      %
Mean rel. region size (0.95 level)      80.2124 %
Total Number of Instances             1036     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0,847    0,000    1,000      0,847    0,917      0,858    0,989     0,991     0
                 1,000    0,153    0,868      1,000    0,929      0,858    0,989     0,988     1
Weighted Avg.    0,924    0,076    0,934      0,924    0,923      0,858    0,989     0,989     

=== Confusion Matrix ===

   a   b   <-- classified as
 439  79 |   a = 0
   0 518 |   b = 1
Était-ce utile?

La solution

Reading from "Data Mining: Practical Machine Learning Tools and Techniques" by Witten and Frank (the companion book for Weka) a problem is pointed out in naive Bayes.

If a particular attribute value does not appear with every possible class value, then the zero attribute has undue influence over the class prediction. In Weka, this possibility is avoided by adding one to the numerator of every categorical attribute when calculating the conditional probabilities (with the denominator adjusted accordingly). If you look at your example you can verify this is what was done.

Below I attempt to explain the undue influence that is exhibited by the absence of an attribute value.

The naive bayes formula:

P(y|x)= ( P(x1|y) * P(x2|y) * ... * P(xn|y) * P(Y) ) / P(x)

From the naive bayes formula we can see what they mean:

Say:

  • P(x1|y1) = 0
  • P(x2|y1) ... P(xn|y1) all equal 1

From the above formula:

  • P(y1|x) = 0

Even though all other attributes strongly indicate that the instance belongs to class y1, the resulting probability is zero. The adjustment made by Weka allows for the possibility that the instance still comes from the class y1.

A true numeric example can be found starting around slide 12 on this webpage

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top