Trying to use the Naive Bayes Learner in R but predict() giving different results than model would suggest

StackOverflow https://stackoverflow.com/questions/19009259

  •  29-06-2022
  •  | 
  •  

Question

I'm trying to use the Naive Bayes Learner from e1071 to do spam analysis. This is the code I use to set up the model.

library(e1071)
emails=read.csv("emails.csv")
emailstrain=read.csv("emailstrain.csv")
model<-naiveBayes(type ~.,data=emailstrain)

there a two sets of emails that both have a 'statement' and a type. One is for training and one is for testing. when I run

model

and just read the raw output it seems that it gives a higher then zero percent chance to a statement being spam when it is indeed spam and the same is true for when the statement is not. However when I try to use the model to predict the testing data with

table(predict(model,emails),emails$type)

I get that

    ham  spam
ham 2086 321
spam 2   0

which seems wrong. I also tried using the training set to test the data on as well, and in this case it should give quite good results, or at least as good as what was observed in the model. However it gave

    ham  spam
ham 2735 420
spam 0   6

which is only slightly better then with the testing set. I think it must be something wrong with how the predict function is working.

how the data files are set up and some examples of whats inside:

type,statement
ham,How much did ur hdd casing cost.
ham,Mystery solved! Just opened my email and he's sent me another batch! Isn't he a sweetie
ham,I can't describe how lucky you are that I'm actually awake by noon
spam,This is the 2nd time we have tried to contact u. U have won the £1450 prize to claim just call 09053750005 b4 310303. T&Cs/stop SMS 08718725756. 140ppm
ham,"TODAY is Sorry day.! If ever i was angry with you, if ever i misbehaved or hurt you? plz plz JUST SLAP URSELF Bcoz, Its ur fault, I'm basically GOOD"
ham,Cheers for the card ... Is it that time of year already?
spam,"HOT LIVE FANTASIES call now 08707509020 Just 20p per min NTT Ltd, PO Box 1327 Croydon CR9 5WB 0870..k"
ham,"When people see my msgs, They think Iam addicted to msging... They are wrong, Bcoz They don\'t know that Iam addicted to my sweet Friends..!! BSLVYL"
ham,Ugh hopefully the asus ppl dont randomly do a reformat.
ham,"Haven't seen my facebook, huh? Lol!"
ham,"Mah b, I'll pick it up tomorrow"
ham,Still otside le..u come 2morrow maga..
ham,Do u still have plumbers tape and a wrench we could borrow?
spam,"Dear Voucher Holder, To claim this weeks offer, at you PC please go to http://www.e-tlp.co.uk/reward. Ts&Cs apply."
ham,It vl bcum more difficult..
spam,UR GOING 2 BAHAMAS! CallFREEFONE 08081560665 and speak to a live operator to claim either Bahamas cruise of£2000 CASH 18+only. To opt out txt X to 07786200117

I would really love suggestions. Thank you so much for your help

Was it helpful?

Solution

Actually predict function works just fine. Don't get me wrong but problem is in what you are doing. You are building the model using this formula: type ~ ., right? It is clear what we have on the left-hand side of the formula so lets look at the right-hand side.

In your data you have only to variables - type and statement and because type is dependent variable only thing that counts as independent variable is statement. So far everything is clear.

Let's take a look at Bayesian Classifier. A priori probabilities are obvious, right? What about conditional probabilities? From the classifier point of view you have only one categorical Variable (your sentences). For the classifier point it is only some list of labels. All of them are unique so a posteriori probabilities will be close to the the a priori.

In other words only thing we can tell when we get a new observation is that probability of it being spam is equal to probability of message being spam in your train set.

If you want to use any method of machine learning to work with natural language you have to pre-process your data first. Depending on you problem it could for example mean stemming, lemmatization, computing n-gram statistics, tf-idf. Training classifier is the last step.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top