質問

Currently, some spam waves, especially when sport events happen, are flooding the internet.

As I strongly doubt that the usernames of the spammers aren't computer generated, I thought it might be interesting to try learning spammer names programatically somehow.

A user name should be between 2 and 15 characters, begin with a letter and contain only letters, numbers, _ or -.

A sample list of names would be

riazsports0171
maya34444
thelmaeatons
tigran777
newlive100
darbeshbaba
litondina10
nithuhasan
newlive100
bankuali
lldztwydni554
monomala505
nasiruddin1500
lldztwydni554
ariful3032
nazmulhasan

I do only have a fairly basic knowledge of algorithms (from university). My question is, which machine learning algorithms and/or string metrics I could use for predicting if an arbitary username is probably a spammer or not. I thought about using cosine string similaritz, because its fairly simple.

役に立ちましたか?

解決

Interesting. But I don't think string similarity algorithms are the best solution.

I'd try to extract features from the names, and use a classification algorithm. SVM usually provides very good results comparing to other classification algorithms, but there are other algorithms as well (For example: Naive Bayes, Decision Tree, KNN) each with its advantages and disadvantages.

The tricky part will be to extract the features. You should be creative. Some options are: number of digits, number of consecutive letters, number of consecutive consonant, usage of capitalization, correct usage of capitalization, is matching a certain regex, ... (You could also use other features not from the string, such as number of msgs sent by this user to you, ....)

Next, you need to create a training set. This training set will contain both spammers and non-spammers user names, which are manually labeled for spammers or non-spammers.

Feed the training set to your algorithm of choice, and it will create a classifier, which you will be able to use to predict if new users are spammers or not.

You can evaluate effectiveness of each algorithm by using cross validation on your data.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top