Question

I am clustering my data to see how information look like and which group may be identified. Since clustering is an unsupervised algorithm, I cannot test the accuracy of the classification. So I was wondering what type of consideration I can make after using clustering. For example, if I had many emails, with no flag or label for spam/not spam, how could I use clustering to group them into two groups and test the ‘accuracy’ of the clustering?

To give more context on what I am trying to do: I have different files (csv) having fields like date, users, emails’ subjects and emails’ bodies. I would like to run some analysis but, in order to do this, I would need to classify emails into spam/not spam. I have 23000 emails so it is very difficult to do this manually. I already included in a list of words the common words used as flag for spam (ads, buy, offer, porn, promotion,...) but, since the most of emails has no these words in a title or in the body, this first step can assign ‘spam’ flag to around 100 emails. Very low! I have tried with topic classification (lda) but it is not so accurate. I thought then to use k-means clustering to assign these labels, once labelled manually around 300 emails. I do not know if this is the right way to proceed for assignignig labels, so comments and answers would be greatly appreciated.

Était-ce utile?

La solution

This is basic architecture of spam filter :

enter image description here

Statistically,spam bear lower entropy ( i.e., higher similarities) than legitimate emails.

We could use bisect k-means clustering after doing topic modelling. In k-means we had to specify k which lead to drastic change in results and it also leads to empty clusters.

I would recommend going through this paper as it highlight this approach.

Licencié sous: CC-BY-SA avec attribution
scroll top