Dealing with low-information centroids using Nearest Centroid Classifier and bag of words method

https://datascience.stackexchange.com/questions/67037

21-10-2020
|

Pergunta

I am currently working on a problem where we have projects and e-mails that belong to a single project each.

My goal is to create a recommendation system for incoming e-mails which presents the projects the e-mail might belong to.

The number of projects is constantly growing, just like the number of e-mails. This is why I decided to use the Nearest Centroid Classifier because the "training" of new classes is easy (after all, just calculating a mean over the e-mails belonging to a centroid) and it seemed to be promising to me.

I use NCC in combination with the bag of words method and for that I am calculating the scores of the words via TF-IDF.

The data pool is not the greatest actually, which is another reason I tried using a less complex model like NCC. I only have 5000 useful e-mails and around 300 projects.

The problem is though that when I calculate the distances to every project centroids some centroids win in every case. For nearly every e-mail the first 10 centroids of the recommendation are the same, all the time. When I had a look at them I noticed that the "best" centroids are simply centroids which do not hold much information, they seem to have very few text data. And if the e-mail does not have much text obviously the error is low and thus the distance is low.

Is there any way to deal with that problem? Or is TF-IDF and NCC not a good combination?

Solução

This is a standard problem with distance/similarity measures between texts of different length. I'm not aware of any standard way to solve it, but in your case I would simply try to remove any email shorter than a certain length from the training set (you can experiment with different thresholds). This would hopefully force the centroids to be more specific, the goal being that none of them can easily attract all the instances.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange