Domanda

I am now implementing an email filtering application using the Naive Bayes algorithm. My application uses the Spambase Data Set from the UCI Machine Learning Repository. Since the attributes are continuous, I calculate the probability using the Probability Density Function (PDF). However, when I evaluate the data using the k-fold cross validation, a training set may contain only 0 for one of its attributes. For this reason, I got a 0 standard deviation and the PDF returns NaN and it leads to a huge number of spams are not correctly classified with that training set. What should I do to fix the problem?

È stato utile?

Soluzione

You could use a discrete PDF, which will always be bounded.

Alternatively, simply ignore any attribute with zero variance. There is no point in including distributions with zero variance, because they won't actually do anything. For example, you want to know how old I am, and then I tell you that I live on planet Earth. That shouldn't change your estimate, because every single piece of data you have is for people on planet Earth.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top