How do I evaluate Clustering?

https://stackoverflow.com/questions/9109525

21-04-2021
|

Question

I am still researching on evaluating clusters formed using clustering (unsupervised learning)?

I tried googling but the measures I get are too theoretical. It will be great if people can share the mechanisms they are using to evaluate the clusters formed. Say I have a Java Cluster so that will contain Java EE, Java ME, RMI, JVM etc. ,another cluster say NoSQL and that will have something like Neo4j, OrientDB, CouchDB etc. This is perfect and my clustering Algorithm has given me most accurate clusters.

However after training and then testing I may get say MySQL, Oracle under NoSQL cluster so I just do a manual/visual interpretation and then re-train my Algorithm or tweak it so that I get better Clustering.

Now I want to automate this process of visualizing clusters manually and have a system that gives me the accuracy of clusters formed. I am looking out for something similar to Precision , Recall, NDCG, Map etc used in search. My clusters are varying in length and there can be n - different cluster formed so precision/recall would not be the right thing.

Solution

I'm working on a project with Clustering and I'm having the same question so far.

Right now I'm using the JavaML library which has built-in several clustering algorithms (in my case I'm using K-means) and this library also has several functions to evaluate this algorithms.

The function I'm using to evaluate the 'quality' of my clusters is the sum of the squared errors of the elements of each cluster. To explain not so mathematically this method of evaluation, basically the sum of squared errors summarize the distance of each element of every cluster to their respective cluster centroid (in case of K-means). This is not a perfect and ideal evaluation as you like that may be better than the visual comparation (I have the same problem) but at least is a formal way to identify 'how good are your clusters'. It's cheap, fast and can give you a general view of your clusters.

You may also want to check the 'Cluster labeling' problem. It's not trivial but it intends to attack that same problem.

I think the right answer for your question depends on the clustering algorithm you are using and understand some mathematical theories here because that's not an easy subject :)

Good luck with that!

OTHER TIPS

Normally clustering is used as a unsupervised and semi-supervised learning algorithm. Since your have mentioned “However after training and then testing I may get say MySQL,…..” I assume that you are using a semi-supervised clustering algorithm for your application.

You can increase the number of input features (or probably do several experiments while increasing number of input features) see how the accuracy of your system changes w.r.t. size of the feature vector.

Moreover, You can evaluate different cluster algorithm and select the best algorithm which gives best prediction accuracy.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow