Clustering pair-wise distance dataset
-
16-10-2019 - |
문제
I have generated a dataset of pairwise distances as follows:
id_1 id_2 dist_12
id_2 id_3 dist_23
I want to cluster this data so as to identify the pattern. I have been looking at Spectral clustering and DBSCAN, but I haven't been able to come to a conclusion and have been ambiguous on how to make use of the existing implementations of these algorithms. I have been looking at Python and Java implementations so far.
Could anyone point me to a tutorial or demo on how to make use of these clustering algorithms to handle the situation in hand?
해결책
In the scikit-learn implementation of Spectral clustering and DBSCAN you do not need to precompute the distances, you should input the sample coordinates for all id_1
... id_n
. Here is a simplification of the documented example comparison of clustering algorithms:
import numpy as np
from sklearn import cluster
from sklearn.preprocessing import StandardScaler
## Prepare the data
X = np.random.rand(1500, 2)
# When reading from a file of the form: `id_n coord_x coord_y`
# you will need this call instead:
# X = np.loadtxt('coords.csv', usecols=(1, 2))
X = StandardScaler().fit_transform(X)
## Instantiate the algorithms
spectral = cluster.SpectralClustering(n_clusters=2,
eigen_solver='arpack',
affinity="nearest_neighbors")
dbscan = cluster.DBSCAN(eps=.2)
## Use the algorithms
spectral_labels = spectral.fit_predict(X)
dbscan_labels = dbscan.fit_predict(X)