Frage

I am developing a spam detection application for android,I am using Bayesian classification for detecting spam messages. What I want to know is that whether I should use a training set of 50 ham messages and 50 spam messages or whether I should do user based content training? What effect will it have on the effectiveness of the application? I know this might be a broad discussion but I would like a precise answer, not a discussion here.

War es hilfreich?

Lösung

It looks like you'll need thousands of training messages.

Note that spammers have discovered ways to get past this kind of filter, e.g. mispellings like "v1agra". Iterative refinements to the classifier might catch up to their current techniques.

Bayesian_spam_filtering looks like a good place to start, esp. its references to in-depth articles.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top