문제

I would like to use the knn distance plot to be able to figure out which eps value should I choose for the DBSCAN algorithm. Based on this page:

The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts. Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter.

Using python with numpy/sklearn, I have the following points, with the following distance for 6-knn:

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=len(X)).fit(X)
distances, indices = nbrs.kneighbors(X)

# Indices

[[0 1 2 3 4 5]
 [1 0 2 3 4 5]
 [2 1 0 3 4 5]
 [3 4 5 0 1 2]
 [4 3 5 0 1 2]
 [5 4 3 0 1 2]]

# Distances
[[ 0.          1.          2.23606798  2.82842712  3.60555128  5.        ]
[ 0.          1.          1.41421356  3.60555128  4.47213595  5.83095189]
[ 0.          1.41421356  2.23606798  5.          5.83095189  7.21110255]
[ 0.          1.          2.23606798  2.82842712  3.60555128  5.        ]
[ 0.          1.          1.41421356  3.60555128  4.47213595  5.83095189]
[ 0.          1.41421356  2.23606798  5.          5.83095189  7.21110255]]

then I computed the average distance:

distances.mean()
2.9269575028354495

The problem is I don't understand how exactly could I represent the same plot as them with distances in y-axis and number of points according to the distances on the x-axis using python.

Thank for your help.

도움이 되었습니까?

해결책

You

  1. take the last column of that matrix
  2. sort descending
  3. plot index, distance
  4. hope to see a knee (if the distance does not work well. there might be none)
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top