Question

As my first project into data science, I would like to pick out the main clusters in noisy data. I think a good example would be trying to pick out certain links on a given StackExchange question that has a number of answers. The most common type of link is a link to a question on the SE network. The next common is either tag links, or links to user profiles. The remaining links might be random links included in posts, which is considered noise. Ideally, I'm looking for a solution where I don't know how many clusters of links there will be ahead of time.

I've implemented my first attempt using scikit-learn and KMeans. However, it's not ideal because I appear to have to specify the number of clusters ahead of time, and I think the random, noisy links get grouped improperly. I also think it's more effective on a larger corpus compared to the relatively small one of URL tokens (though that's just a guess).

Is there a way to do this type of clustering, where the number of clusters is unknown or where one of the clusters is a sort of miscellaneous cluster containing objects that don't closely match the other clusters?

Was it helpful?

Solution

Have you looked at DBSCAN? It is a density-based spatial clustering of data with noise that can define non-linear clusters (unlike k-means).

It doesn't require knowing the number of clusters. However, it does require two parameters (minimum cluster size and neighborhood size) that measure density. But you may be able to estimate them in your particular domain.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top