Question

I have a silly confusion but it is bothering me a lot. I have to make an ANN for spam detection. Till now I have developed modules for developing tfidf vectors of mails and calculating PCA of that matrix separately. The problem is my Mails are directly being read from inbox. For the training I was hoping to use my spam box and then use the same classes that have been used to develop vectors of unread mails.How do I label them as spam?

Should i develop something like this

   HashMap<HashMap<String,Double>,Integer> trainingSet;

The first argument is the mailVector dimensionally reduced via PCA and the Integer is the label 1-Spam and 0 for nonSpam and then write vectors to file and read from them? Or should i rather make my code flexible so instead of reading directly from inbox as of now, it should read from an already existing set of ham and spam online and then model them as mail objects[I have a MailMessage class that defines members like subject , body ,mailvector etc for a mail and then used this for constructing term index and then vectors finally],form vectors and then form a training set and later after training i can make them read my inbox?

Any insight would be appreciated!

Was it helpful?

Solution

I'm going to be honest with you. Frankly, there are not that many words in the English language. If you have a very large input vectors (say tens of thousands), you will probably have the most effective performance. In fact, you may even be able to get buy with no hidden layers.

For image recognition, etc., having an input vector of thousands of points is not uncommon.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top