Question

I'm using k-means clustering to processes running on machines.

Dataset sample :

machine name, process
m1,java
m2,tomcat
m1,word
m3,excel

Build a matrix of associated counts :

   java,tomcat,word,excel
m1,1,0,1,0
m2,0,1,0,0
m3,0,0,0,1

I then run k-means against this dataset (have tried Euclidean and Manhattan distance functions) The dataset is extremely sparse which I think is causing the generated clusters to not make much sense as many machines get grouped into the same cluster(as they are very similar)

How to achieve clusters where each cluster contains approx equal number of points ? Or perhaps this is not possible due to the sparseness of the data and instead I should try to cluster on a different attributes of dataset ?

Was it helpful?

Solution

Cluster analysis is not supposed to produce paritions of equal size. It is meant to discover structure in the data.

If the majority of objects is highly similar, then this majority is supposed to be in the majority cluster.

Consider all your data is identical. Any clustering algorithm producing more than one cluster has failed, in my opinion...

So you may be using the wrong class of algorithms for your problem.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top