Question

Say I have items i1, ..., iN

I would like to cluster them in such a way that:

  1. If I ran the cluster many many times the probability that items iJ and iK would end up in the same cluster is high.
  2. The number of clusters and cluster memberships are relatively stable regardless of cluster seeds

Are there well known algorithms to achieve this?

Clarification:

say I want 3 clusters and say:

  • in reality-1 I start with i1, i33, i89 as seeds for cluster c1 c2 c3
  • in reality-2 I start with i44, i55, i77 as seeds for cluster c1 c2 c3

I want the resulting clusters in both realities to be largely similar

Was it helpful?

Solution 2

A often-seen strategy to make an algorithm more robust with respect to initialization, is to bootstrap it. See for instance this paper.

The other option is to sort the data beforehand and use a strictly deterministic algorithm.

OTHER TIPS

I think that hierarchical clustering algorithms will meet your needs.

  1. Cluster consistency is garanteed for the same set, probability that items iJ and iK would end up in the same cluster is 1.
  2. There is no seed. You choose the right number of cluster by analysing the tree, or using existing cut off algorithms (there are a LOT of them).

[EDIT]

In fact any deterministic clustering algorithm has these features, not just hierarchical clustering.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top