Question

What is the best way to test a clustering algorithm? I am using an agglomerative clustering algorithm with a stop criterion. How do I test if the clusters are formed correctly or not?

Was it helpful?

Solution

It depends on what you want to test against.

When testing your own implementation of a known algorithm, you might want to compare the results with that of a known good implementation.

Hierarchical clustering is hard to test with respect to quality, as it is hierarchical. The common measures such as Rand index etc. are only valid for strict partitionings. You can get a strict partitioning from a hierarchical clustering, but then you need to fix the height to cut at.

OTHER TIPS

A good rule of thumb for evaluating how much a graph can be clustered (on a coarse grained level) has to do with the "eigenvalue gap". Given a weighted graph A, calculate the eigenvalues and sort them (this is the eigenvalue spectrum). When plotted, if there is a large jump in the spectrum at some point, there is a natural corresponding block to partition the graph.

Below is an example (in numpy python) that shows, given an almost block diagonal matrix there a large gap in the eigenvalue spectrum at the number of blocks (parameterized by c in the code). Note that a matrix permutation (identical to labeling your graph nodes) still gives the same spectral gap:

from numpy import *
import pylab as plt

# Make a block diagonal matrix
N = 30
c = 5
A = zeros((N*c,N*c))
for m in xrange(c):
    A[m*N:(m+1)*N, m*N:(m+1)*N] = random.random((N,N))

# Add some noise
A += random.random(A.shape) * 0.1

# Make symmetric
A += A.T - diag(A.diagonal())

# Show the original matrix
plt.subplot(131)
plt.imshow(A.copy(), interpolation='nearest')

# Permute the matrix for effect
idx = random.permutation(N*c)
A = A[idx,:][:,idx]

# Compute eigenvalues
L = linalg.eigvalsh(A)

# Show the results
plt.subplot(132)
plt.imshow(A, interpolation='nearest')
plt.subplot(133)
plt.plot(sorted(L,reverse=True))

plt.plot([c-.5,c-.5],[0,max(L)],'r--')

plt.ylim(0,max(L))
plt.xlim(0,20)
plt.show() 

enter image description here

Ideally you have some kind of pre-clustered data (supervised learning) and test the results of your clustering algorithm on that. Simply count the number of correct classifications divided by the total number of classifications performed to get an accuracy score.

If you are doing unsupervised learning, then there is really no way to evaluate your algorithm.

It is sometimes useful to construct input data where there is a known, and perhaps obvious, answer by construction. For a clustering algorithm, you might construct data with N clusters such that the maximum distance between any two points in the same cluster is smaller than the minimum distance between any two points in different clusters. Another option would be to generate a number of different data sets plotable as 2-d scatter diagrams with clusters obvious to the eye, then compare the result from your algorithm with this structure, perhaps moving the clusters together to see when the algorithm fails to see them.

You might be able to do better given knowledge of your particular clustering algorithm, but the above might at least have some chance of flushing obvious bugs from cover.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top