What type of “spam filter” algorithm will tokenize characters for non-exact matches?

https://softwareengineering.stackexchange.com/questions/310788

12-12-2020
|

Pergunta

I was having a look at a basic example of spam filtering using a logistic regression algorithm and an answer posted on this Stack Overflow question: https://stackoverflow.com/questions/29291263/mllib-classification-example-stops-in-stage-1

Correct me if I'm wrong, but these algorithms don't appear to use any sort of tokenization. For example, if my spam word is hey then the word heyyyyy might pass as a false negative through the filter.

Is there an algorithm or process that can be added to basic logistic regression to improve this? Or do I need to look heavily at LDA and Topic Modelling? Or will n-gram tokenization of characters work?

Update

Although I suspected it already, I ran my own test to be sure. The LogisticRegressionWithSGD does not use any sort of tokenization (nor does the example). The example in the link fails with moneyyyyy although money will trigger a spam prediction.

Solução

So after deeply looking into this issue for a few hours, I've been able to split this up into a few different solutions and develop an intermediary solution for myself which I think will lead me to solve my own use case.

Neural Nets

Though not a lot of easily accessible information is available, neural nets using a ton of training data (and proper features) are best for developing such filters that can also continue learning on new trends that are reflective of the classifier. A great example is Gmail's spam filter, which is also capable of learning user's personal preference.

Hidden Markov Models

HMMs have the capability of looking past intentional misspellings in language (and can be more easily localized) and can classify even when a sample is intentionally attempting to fool the classifier. Unfortunately, there aren't too many HMM examples readily available to demonstrate the concept, though this paper about Dynamically Weighted HMMs is rather descriptive of the concept.

N-gram character tokenization

N-grams involve "slicing" up words into smaller parts which is particularly useful for natural language processing. Character tokenization adds the benefit of looking past most language misspellings and slang to derive the "original" word (in a simple sense). Tokenization can take place either during training or during classification. To respond directly to the question, this can be added to either the dataset or sample in the Logistic Regression classification pipeline.

and...

Intermediate solution

Based on the constraints of time and available examples, I was able to quickly scruff up some basic tokenization and provide an up-to-date example based on my original question. I came up with the following code:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint


val conf = new SparkConf().setAppName("TokenizationTest").setMaster("local")
val sc = new SparkContext(conf)
// Load 2 types of emails from text files: spam and ham (non-spam).
// Each line has text from one email.
val spam = sc.textFile("testdata/spam.txt")
val ham = sc.textFile("testdata/ham.txt")
println("Loaded sample data")
// Create a HashingTF instance to map email text to vectors of 100 features.
val tf = new HashingTF(numFeatures = 5)
// Each email is split into words, and each word is mapped to one feature.
val spamFeatures = spam.map(email => tf.transform(email.split(" ").map(_.sliding(3,1))))
val hamFeatures = ham.map(email => tf.transform(email.split(" ").map(_.sliding(3,1))))
// Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
val trainingData = positiveExamples ++ negativeExamples
trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
// Create a Logistic Regression learner which uses the LBFGS optimizer.
val lrLearner = new LogisticRegressionWithSGD()
// Run the actual learning algorithm on the training data.
val model = lrLearner.run(trainingData)
// Test on a positive example (spam) and a negative one (ham).
// First apply the same HashingTF feature transformation used on the training data.
val posTestExample = tf.transform("get moneyyyyy ...".split(" "))
val negTestExample = tf.transform("Hi Dad, I started studying Spark the other ...".split(" "))
// Now use the learned model to predict spam/ham for new emails.
println(s"Prediction for positive test example: ${model.predict(posTestExample)}")
println(s"Prediction for negative test example: ${model.predict(negTestExample)}")
sc.stop()

The key difference is that I added tokenization to the training data:

val spamFeatures = spam.map(email => tf.transform(email.split(" ").map(_.sliding(3,1))))
    val hamFeatures = ham.map(email => tf.transform(email.split(" ").map(_.sliding(3,1))))

In Scala, sliding(n,n) effectively tokenizes a string. This can be further improved by iterating on different token sizes though I imagine care must be given to creating short & useless "stop" words like at. There are probably many drawbacks to this method - however, I imagine I'm on the right path.

Although neural nets are rather unattainable with my dataset, HMMs are probably going to be the better solution.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange