Question

I am new to clustering and doing some minor project on clustering tweets, I used TF-IDF and then hierarchial clustering. I am confused about setting up threshold value for hierarchical clustering. What should be its value and how to decide it?
I used python scikit module for implementation.

Was it helpful?

Solution

While there are several methods that exist to help terminate hierarchical clustering (or clustering in general) there is no best general way to do this. This stems from the fact that there is no "correct" clustering of arbitrary data. Rather, "correctness" is very domain and application specific.

So while you can try out different methods (e.g., elbow or others) they will in turn have their own parameters that you will have to "tune" to obtain a clustering that you deem "correct". This video might help you out a bit (though it mainly deals with k-means, the concepts extend to other clustering approaches) - https://www.youtube.com/watch?v=3JPGv0XC6AE

OTHER TIPS

I assume you are talking about choosing the amount of clusters to extract from your hierarchical clustering algorithm. There are several ways of doing this, and there is a nice Wikipedia article about it for some theory: http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set

For practical examples take a look at this question: Tutorial for scipy.cluster.hierarchy

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top