Differentiate between positive and negative clusters

https://datascience.stackexchange.com/questions/85777

16-12-2020
|

Pergunta

I have applied k-means clustering on my dataset of Amazon Alexa reviews.

model = KMeans(n_clusters=2, max_iter=1000, random_state=True, n_init=50).fit(X=word_vectors.vectors.astype('double'))

Now I want to check which cluster is positive and which is negative, can anyone suggest me some way to do that?

Also, is there any way to check is a particular word belongs to which cluster. E.g, the word 'bad' belongs to which cluster - 0 or 1

Solução

Maybe you don't have a positive and a negative class. Your input are word vectors. Unless you trained your word vectors before with explicit positive and negative labels, it is very unlikely that your KMeans learned that difference.

If you used pre-trained word vectors, your KMeans could have learned an arbitrary difference between cluster 0 and cluster 1. Maybe it learned which reviews are from males and which from females, maybe which have the word "parachute" and which don't have the word "parachute", the options are endless.

What you can do, is access which labels your KMeans learned (model.labels_) and filter your input X per cluster. Then, count the occurence of each word in each cluster and order which words happen the most in each of them. This might help you understand the difference between cluster 0 and cluster 1.

Note: if the top words you get are words like: a, the, of, if, etc. Use a stop-word list, or filter those word with a max document frequency threshold.

Outras dicas

NLTK has a sentiment module. You can try and check the statistics of positive vs negative for each text in the clusters.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange