
I have created a neural network for detection spam. It involves the following steps;

1.Formation of tf-idf matrix of terms and mails.
2.Reduction of matrix using PCA.
3.Feeding the 20 most important terms according to eigen values to neural network as features.

I'm training it for 1-Spam and 0-Not spam.

EDITS: I decided to train it by taking a batch size of 7 mails because it was prone to showing Out of memory error while forming the matrix. I used the standard enron dataset of ham and spam . I used to train neural network via back-propagation -1 input - 1 hidden - 1 output layer with 20 neurons in first layer and 6 hidden layer neurons.

So I started training with my original spam mails in my gmail giving very bad results before switching it to enron dataset. Satisfactory outputs were obtained after training quite a lot.

6 out of 14 mails were being detected spam when i tested.

I used alternative training like batch 1 of spam mails and batch 2 for ham mail and so on such that the network is trained for 1 output for spam and 0 for ham .

But now after too much training almost 400-500 mails i guess, it if giving bad results again . I reduced learning rate but no help. What's going wrong?

Was it helpful?


To summarize my comments into an answer... If you're net is producing results that you would expect and then after additional training the output is less accurate, then there is a good chance it is overtrained.

This is especially prone to happen if your data set is small or doesn't vary enough. Finding the optimal number of epochs is mostly trial-and-error.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top