Question

I want to know whether the k-means clustering algorithm can do classification?

If I have done a simple k-means clustering .

Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is Euclidean distance.

Cluster A at left side.

Cluster B at right side.

So, if I have one new data. What should I do?

  1. Run k-means clustering algorithm again, and can get which cluster does the new data belong to?

  2. Record the last centroid and use Euclidean distance to calculating to decide the new data belong to?

  3. other method?

Was it helpful?

Solution

The simplest method of course is 2., assign each object to the closest centroid (technically, use sum-of-squares, not Euclidean distance; this is more correct for k-means, and saves you a sqrt computation).

Method 1. is fragile, as k-means may give you a completely different solution; in particular if it didn't fit your data well in the first place (e.g. too high dimensional, clusters of too different size, too many clusters, ...)

However, the following method may be even more reasonable:

3. Train an actual classifier.

Yes, you can use k-means to produce an initial partitioning, then assume that the k-means partitions could be reasonable classes (you really should validate this at some point though), and then continue as you would if the data would have been user-labeled.

I.e. run k-means, train a SVM on the resulting clusters. Then use SVM for classification.

k-NN classification, or even assigning each object to the nearest cluster center (option 1) can be seen as very simple classifiers. The latter is a 1NN classifier, "trained" on the cluster centroids only.

OTHER TIPS

Yes, we can do classification.

I wouldn't say the algorithm itself (like #1) is particularly well-suited to classifying points, as incorporating data to be classified into your training data tends to be frowned upon (unless you have a real-time system, but I think elaborating on this would get a bit far from the point).

To classify a new point, simply calculate the Euclidean distance to each cluster centroid to determine the closest one, then classify it under that cluster.

There are data structures that allows you to more efficiently determine the closest centroid (like a kd-tree), but the above is the basic idea.

If you've already done k-means clustering on your data to get two clusters, then you could use k Nearest Neighbors on the new data point to find out which class it belongs to.

Here another method:

I saw it on "The Elements of Statistical Learning". I'll change the notation a little bit. Let C be the number of classes and K the number of clusters. Now, follow these steps:

  1. Apply K-means clustering to the training data in each class seperately, using K clusters per class.
  2. Assign a class label to each of the C*K clusters.
  3. Classify observation x to the class of the closest cluster.

It seems like a nice approach for classification that reduces data observations by using clusters.

If you are doing real-time analysis where you want to recognize new conditions during use (or adapt to a changing system), then you can choose some radius around the centroids to decide whether a new point starts a new cluster or should be included in an existing one. (That's a common need in monitoring of plant data, for instance, where it may take years after installation before some operating conditions occur.) If real-time monitoring is your case, check RTEFC or RTMAC, which are efficient, simple real-time variants of K-means. RTEFC in particular, which is non-iterative. See http://gregstanleyandassociates.com/whitepapers/BDAC/Clustering/clustering.htm

Yes, you can use that for classification. If you've decided you have collected enough data for all possible cases, you can stop updating the clusters, and just classify new points based on the nearest centroid. As in any real-time method, there will be sensitivity to outliers - e.g., a caused by sensor error or failure when using sensor data. If you create new clusters, outliers could be considered legitimate if one purpose of the clustering is identify faults in the sensors, although that the most useful when you can do some labeling of clusters.

You are confusing the concepts of clustering and classification. When you have labeled data, you already know how the data is clustered according to the labels and there is no point in clustering the data unless if you want to find out how well your features can discriminate the classes.

If you run the k-means algorithm to find the centroid of each class and then use the distances from the centroids to classify a new data point, you in fact implement a form of the linear discriminant analysis algorithm assuming the same multiple-of-identity covariance matrix for all classes.

After k-means Clustering algorithm converges, it can be used for classification, with few labeled exemplars/training data. It is a very common approach when the number of training instances(data) with labels are very limited due to high cost of labeling.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top