How can KMeans be used to assert that a dataset has noise?

https://stackoverflow.com/questions/22334987

13-06-2023
|

Question

I have come across an extract from an old paper which casually mentions,

If required, we could use KMeans as a method of asserting that this dataset is noisy, thus proving that our classifier working as well as can be reasonably expected.

I can find no mention of this method after trawling the Internet for solutions. How can this be done? How can this generic KMeans code be adapted to assert that this dataset contains noise?

Code ripped from here

print(__doc__)


# Code source: Gael Varoqueux
# Modified for Documentation merge by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import pylab as pl
from mpl_toolkits.mplot3d import Axes3D


from sklearn.cluster import KMeans
from sklearn import datasets

np.random.seed(5)

centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target

estimators = {'k_means_iris_3': KMeans(n_clusters=3),
              'k_means_iris_8': KMeans(n_clusters=8),
              'k_means_iris_bad_init': KMeans(n_clusters=3, n_init=1,
                                              init='random')}


fignum = 1
for name, est in estimators.iteritems():
    fig = pl.figure(fignum, figsize=(4, 3))
    pl.clf()
    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

    pl.cla()
    est.fit(X)
    labels = est.labels_

    ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))

    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel('Petal width')
    ax.set_ylabel('Sepal length')
    ax.set_zlabel('Petal length')
    fignum = fignum + 1

# Plot the ground truth
fig = pl.figure(fignum, figsize=(4, 3))
pl.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

pl.cla()

for name, label in [('Setosa', 0),
                    ('Versicolour', 1),
                    ('Virginica', 2)]:
    ax.text3D(X[y == label, 3].mean(),
              X[y == label, 0].mean() + 1.5,
              X[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
pl.show()

Solution

The essence of K-means clustering is dividing a set of multi-dimensional vectors into tightly-grouped partitions, and then representing each partition (a.k.a. cluster) by a single vector (a.k.a. centroid). Once you do this, you can compute goodness-of-fit i.e. how well the obtained centroids represent the original set of vectors. This goodness-of-fit will depend on the number of clusters/centroids chosen, the training algorithm used (e.g. LBG algo), method to select initial centroids, metric used to compute distance between vectors... and, of course, on statistical properties of your data (the multi-dimensional vectors).

After performing clustering, you could use the goodness-of-fit (or quantization distortion) to make some judgments about your data. For example, if you had two different data sets giving two significantly different goodness-of-fit values (while keeping all other factors, particularly the number of clusters, identical), you could say that the set with worse goodness-of-fit is more "complex", perhaps more "noisy". I am putting these judgements into quotes because they are subjective (e.g. how do you define noisiness?) and are strongly influences by your training algorithm and other factors etc.

Another example could be to train a clustering model using a "clean" data set. Then, use the same model (i.e. the same centroids) to cluster a new data set. Depending on how the goodness-of-fit for the new data set differs from the goodness-of-fit of the original clean training data set, you could make some judgment about "noise" in the new data set.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow