Frage

I am clustering undirected graphs using mcl. To do so, I have choose a threshold under which nodes are connected, a similarity measure for each edge and the inflation parameter to tune the granularity of my graph. I have been playing around with these parameters, but so far, the clusters I have seem to be too large (I did visualizations that suggest that the largest clusters should be cut into 2 or more clusters). Therefore, I was wondering what are the other parameters I can play with to improve my clustering (I am currently working with the scheme parameter of mcl to see whether increasing the accuracy would help, but if there are other 'more specific' parameters that could help to get smaller clusters for instance, please let me know)?

War es hilfreich?

Lösung

There are really mainly two things to consider. The first and most important is outside mcl (http://micans.org/mcl/) itself, namely how the network is constructed. I've written about it elsewhere, but I'll repeat it here because it is important.

If you have a weighted similarity, choose an edge-weight (similarity) cutoff such that the topology of the network becomes informative; i.e. too many edges or too few edges yield little discriminative information in the absence/presence structure of edges. Choose it such that no edges connect things you consider very dissimilar, and that edges connect things you consider somewhat similar to quite similar. In the case of mcl, the dynamic range in edge weight between 'a bit similar' and 'very similar' should be, as a rule of a thumb, one order of magnitude, i.e. two-fold or five-fold or ten-fold, as opposed to varying from 0.9 to 1.0. Of course, it is possible to give simple networks to mcl and it will just utilise the absence/presence of edges. Make sure the network does not become very dense - a very rough rule of thumb could be to aim for a total number of edges that is in the order of V * sqrt(V) if the number of nodes (vertcies) is V, that is, each node has, on average, in the order of sqrt(V) neighbours.

The above, network construction, is really crucial, and it is advisable to try different approaches. Now, given a network, there is really only one mcl parameter to vary: the inflation parameter (the -I option). A good set of values to test with is 1.4, 2, 3, 4, 6.

In summary, if you are exploring, try different ways of network construction, using your knowledge of the data to make the network a meaningful representation, and combine this with trying different mcl inflation values.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top