Question

I've been using clustering in my bag of ML techniques for quite some time now, and I've never found a satisfying answer to this question.

In DBSCAN, we define a maximum radius with which to form clusters. The algorithm will scan the space and group together points that are ALL reachable from one another. However, we can sometimes end up with a non-convex cluster.

My confusion is around how the notion of a "radius", which describes a convex object, can be an input to an algorithm which results in a non-convex object?

Was it helpful?

Solution

A cluster in DBSCAN consists of multiple core points.

The radius is the area covered by a single core point, but together with neighbor core points the shape will be much more complex. In particular, they can be much larger than epsilon, so you should choose a small value, and rely on this "cover" functionality.

Wikipedia has an example of a non-convex cluster

OTHER TIPS

I think it's non-convex because the particular cluster assignment you get when applying DBSCAN depends on the order you traverse the data.

Let's try to illustrate it with an example. Consider this dataset:

enter image description here

You want to run DBSCAN with radius $r = 3$ and $\text{min_pts} = 4$, so you get this:

enter image description here

The point in the center is not a core point because it has only 3 points, not 4, and we have only two core points. And depending on how you traverse the data points, you may get different cluster assignments:

enter image description here

The top picture shows the result we'd get by traversing left-to-right, and the bottom picture - by traversing right-to-left.

Apparently both these results would correspond to the same value of the cost function, thus the cost function has several minima and it is not convex.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top