Frage

I am doing clustering using mcl. I am trying to "optimize" the clustering with respect to a quality score by tuning the inflation parameter I and a couple of other parameters I introduced.

I have questions with respect to this optimization:

1) Correct me if I am wrong: Cross validation is used when we try to predict the classes for new input. Therefore, this concept has no sense in the context of clustering when all the inputs are known and we just try to regroup them

2) I am planning on running experiments with different sets of my parameters and then selecting the ones that give me the best results. However, I read about clm close and the possibility of using hierarchical clustering and going through the tree to find the best parameters. I am not familiar with hierarchical clustering, but how would this method outperform just testing different parameters?

War es hilfreich?

Lösung

As for (1) I would agree. As for (2), that was a very specialist remark, not something to consider when starting a general exploratory (cluster) analysis.

A note about (1) however. If your data is already classified (each node comes with a label), then you can treat this classification as a clustering and see how well the data clustering matches the classification, using a criterion such as Variation of Information or split/join distance. This can be useful in a scenario where such a classification is available for one particular data set but not for others. It is then worthwhile to be aware that consistency is more important than exactness. That is, a data clustering can be a (near) super-clustering or sub-clustering of the classification and in that respect be consistent (see https://stats.stackexchange.com/questions/24961/comparing-clusterings-rand-index-vs-variation-of-information).

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top