Question

When a data set is analyzed by a clustering algorithm in ELKI 0.5, the program produces a number of statistics: the Jaccard index, F1-Measures, etc. In order to calculate these statistics, there have to be 2 clusterings to compare. What is the clustering created by the algorithm compared to?

Was it helpful?

Solution

The automatic evaluation (note that you can configure the evaluation manually!) is based on labels in your data set. At least in the current version (why are you using 0.5 and not 0.6.0?) it should only automatically evaluate if it finds labels in the data set.

We currently have not published internal measures. There are some implementations, such as evaluation/clustering/internal/EvaluateSilhouette.java, some of which will be in the next release.

In my experiments, internal evaluation measures were badly misleading. For example on the Silhouette coefficient, the labeled "solution" would often even score a negative silhouette coefficient (i.e. worse than not clustering at all).

Also, these measures are not scalable. The silhouette coefficient is in O(n^2) to compute; which usually makes this evaluation more expensive than the actual clustering!

We do appreciate contributions!

You are more than welcome to contribute your favorite evaluation measure to ELKI, to share with others.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top