Вопрос

I am clustering my data to see how information look like and which group may be identified. Since clustering is an unsupervised algorithm, I cannot test the accuracy of the classification. So I was wondering what type of consideration I can make after using clustering. For example, if I had many emails, with no flag or label for spam/not spam, how could I use clustering to group them into two groups and test the ‘accuracy’ of the clustering?

To give more context on what I am trying to do: I have different files (csv) having fields like date, users, emails’ subjects and emails’ bodies. I would like to run some analysis but, in order to do this, I would need to classify emails into spam/not spam. I have 23000 emails so it is very difficult to do this manually. I already included in a list of words the common words used as flag for spam (ads, buy, offer, porn, promotion,...) but, since the most of emails has no these words in a title or in the body, this first step can assign ‘spam’ flag to around 100 emails. Very low! I have tried with topic classification (lda) but it is not so accurate. I thought then to use k-means clustering to assign these labels, once labelled manually around 300 emails. I do not know if this is the right way to proceed for assignignig labels, so comments and answers would be greatly appreciated.

Это было полезно?

Решение

This is basic architecture of spam filter :

enter image description here

Statistically,spam bear lower entropy ( i.e., higher similarities) than legitimate emails.

We could use bisect k-means clustering after doing topic modelling. In k-means we had to specify k which lead to drastic change in results and it also leads to empty clusters.

I would recommend going through this paper as it highlight this approach.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с datascience.stackexchange
scroll top