Pregunta

I am working on a project related to spam filtering. Many of you might be aware of this technique used by spammers:

  1. writing "items" as "|tem" (pipe instead of i)
  2. $ale instead of sale
  3. h0t instead of hot (zero instead of letter 'o')

etc. etc.

I am wondering if there is a database available for all such possible variants of words using special symbols? Or does any one know about some good strategy to tackle this trick?

Currently what I have done is, I simply replace '@' with 'a', '|' with 'i', '$' with 's' and so on. I need your views on this issue! Please help.

¿Fue útil?

Solución

It seems you are taking a message as a starting point and trying to transform it.

Another aproach could be to start by defining a list of words which are likely to be changed (sale, viagra, etc) and then generate all possible similar words. As a measure of similarity you can take a Levenshtein distance.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top