I am new to clustering and doing some minor project on clustering tweets, I used TF-IDF and then hierarchial clustering. I am confused about setting up threshold value for hierarchical clustering. What should be its value and how to decide it?
I used python scikit module for implementation.

有帮助吗?

解决方案

While there are several methods that exist to help terminate hierarchical clustering (or clustering in general) there is no best general way to do this. This stems from the fact that there is no "correct" clustering of arbitrary data. Rather, "correctness" is very domain and application specific.

So while you can try out different methods (e.g., elbow or others) they will in turn have their own parameters that you will have to "tune" to obtain a clustering that you deem "correct". This video might help you out a bit (though it mainly deals with k-means, the concepts extend to other clustering approaches) - https://www.youtube.com/watch?v=3JPGv0XC6AE

其他提示

I assume you are talking about choosing the amount of clusters to extract from your hierarchical clustering algorithm. There are several ways of doing this, and there is a nice Wikipedia article about it for some theory: http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set

For practical examples take a look at this question: Tutorial for scipy.cluster.hierarchy

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top