DBSCAN in Python: Unexpected result

https://stackoverflow.com/questions/15910899

03-04-2022
|

سؤال

I'm trying to understand the DBSCAN implementation by scikit-learn, but I'm having trouble. Here is my data sample:

X = [[0,0],[0,1],[1,1],[1,2],[2,2],[5,0],[5,1],[5,2],[8,0],[10,0]]

Then I calculate D as in the example provided

D = distance.squareform(distance.pdist(X))

D returns a matrix with the distance between each point and all others. The diagonal is thus always 0.

Then I run DBSCAN as:

 db = DBSCAN(eps=1.1, min_samples=2).fit(D)

eps = 1.1 means, if I understood the documentation well, that points with a distance of smaller or equal 1.1 will be considered in a cluster (core).

D[1] returns the following:

>>> D[1]
array([  1.        ,   0.        ,   1.        ,   1.41421356,
     2.23606798,   5.09901951,   5.        ,   5.09901951,
     8.06225775,  10.04987562])

which means the second point has a distance of 1 to the first and the third. So I expect them to build a cluster, but ...

>>> db.core_sample_indices_
[]

which means no cores found, right? Here are the other 2 outputs.

>>> db.components_
array([], shape=(0, 10), dtype=float64)
>>> db.labels_
array([-1., -1., -1., -1., -1., -1., -1., -1., -1., -1.])

Why is there any cluster?

المحلول

I figure the implementation might just assume your distance matrix is the data itself.

See: usually you wouldn't compute the full distance matrix for DBSCAN, but use a data index for faster neighbor search.

Judging from a 1 minute Google, consider adding metric="precomputed", since:

fit(X)

X: Array of distances between samples, or a feature array. The array is treated as a feature array unless the metric is given as ‘precomputed’.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow