DBSCAN with python and scikit-learn: What exactly are the integer labes returned by make_blobs?

https://stackoverflow.com/questions/15819103

01-04-2022
|

Question

I'm trying to comprehend the example for the DBSCAN algorithm implemented by scikit (http://scikit-learn.org/0.13/auto_examples/cluster/plot_dbscan.html).

I changed the line

X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4)

with X = my_own_data, so I can use my own data for the DBSCAN.

now, the variable labels_true, which is the second returned argument of make_blobs is used to calculate some values of the results, like this:

print "Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels)
print "Completeness: %0.3f" % metrics.completeness_score(labels_true, labels)
print "V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels)
print "Adjusted Rand Index: %0.3f" % \
    metrics.adjusted_rand_score(labels_true, labels)
print "Adjusted Mutual Information: %0.3f" % \
    metrics.adjusted_mutual_info_score(labels_true, labels)
print ("Silhouette Coefficient: %0.3f" %
       metrics.silhouette_score(D, labels, metric='precomputed'))

how can I calculate labels_true from my data X? what exactly do scikit mean with label on this case?

thanks for your help!

Solution

labels_true is the "true" assignment of points to labels: which cluster they should actually belong on. This is available because make_blobs knows which "blob" it generated the point from.

You can't get that for your own arbitrary data X, unless you have some kind of true labels for the points (in which case you wouldn't be doing clustering anyway). This just shows some measures of how well the clustering performed in a fake case where you know the true answer.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow