Question

I need to perform dimensionality reduction on a multi dimensional data set data set that has been clustered using k-means. The data contains positive and negative real numbers obtained from sensor readings of sensors placed on a haptic glove. The data is captured while representing an action say drawing letter "A" as

    0.1373   -1.8764
   -1.7020   -0.8322
    0.4862    0.8276
   -0.0078    1.3597
    0.9008    1.8043
    2.9751    0.7125
   -0.3257    0.1754

Now, my confusions are

  1. I do not get any clustering for multi dimensional data using the following code
K=3;
load('b2.txt');



data = b2;
numObservarations = length(data);
%% cluster
opts = statset('MaxIter', 500, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
    'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 50, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 200, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')

How to rectify this?What is wrong?

  1. After obtaining the clusters across all dimension, I now represent the data by its cluster labels as

    1 1 3 2

and so on.

  • Does this data incorporate temporal ordering of the events? By glancing it does but there are papers which say that clustering does not take into account the temporal ordering.
  • I need to reduce its length. I am aware of Principal component analysis but that is used to select the dimensions and does not reduce data lenght. Is it reasonable to use this reduced format for distance based classification using an incoming test data set?
Was it helpful?

Solution

The code you provide works perfectly well with slight modification for the 2D data set (two features) you provided.

Try it as follows:

data=[    0.1373   -1.8764
         -1.7020   -0.8322
          0.4862    0.8276
         -0.0078    1.3597
          0.9008    1.8043
          2.9751    0.7125
         -0.3257    0.1754];

numObservarations = length(data);
K=3

%% cluster

%opts = statset('MaxIter', 500, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = ...
     kmeans(data, K, 'MaxIter', 500, 'Display', 'iter', ...
            'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);

%% plot data+clusters

figure, hold on
scatter(data(:,1),data(:,2), 50, clustIDX, 'filled')
scatter(clusters(:,1),clusters(:,2), 200, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y')

This is the result:

enter image description here

Once again, the dataset you provided contains 2 features, so it is essentially 2D.

As far as I understand, kmeans clusters the data, it does not by itself perform dimensionality reduction (I await anyone else reading this to correct me). For dimensionality reduction what you really want to do is PCA or similar. Following PCA you can project your data onto the principal component axis and display the clusters in a "lower dimensional" way.

I don't actually understand what you mean by temporal ordering, but I if there is some correlation between temporal events and the features you can expect kmeans to classify (indirectly) according to those events.


Here's another example. This time the cluster size is 3. The centroids of the clusters are in variable clusters output above by kmeans.

enter image description here

The plot on the left shows the points in the 2D feature space colored according to time (the colorbar shows how the relative time relates to color). The middle figure shows what cluster points were assigned to according to a new color scale, same color scale as on the right plot which shows the position of the centroids. The point of the figure is to display the temporal regularity with which features show up.

With regard to your question about temporal ordering, it would appear that kmeans can uncover implicit temporal correlations in the features (if that's what you mean), as shown in the following plot of clustIDX versus time:

enter image description here

But I do not know how it compares to other processing algorithms (why it would be advantageous). I would head to dsp.stackexchange for a better answer.


The subplots were generated with the following code:

subplot(121);
scatter(data(:,2),data(:,3), 50, clustIDX, 'filled')
axis tight 
box on
xlabel('feature 1'), ylabel('feature 2')
title('labelled points')

subplot(122);
scatter(clusters(:,2),clusters(:,3), 200, (1:K)', 'filled')
axis tight
box on
xlabel('feature 1'),ylabel('feature 2')
title('clusters')

Second plot:

figure
scatter([1:length(clustIDX)],clustIDX, 50, clustIDX, 'filled')
xlabel('time'),ylabel('cluster')
box on
axis tight
title('labelled points in time domain')
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top