Question

How can I perform a cluster analysis (e.g. kmeans, complete link,etc) when objects are represented by vectors of different sizes? For example, Object 1 is represented by a 4-dim vector, Object 2 by 6-dim vector, Object 3 by 3-dim vector, etc...

Is there any way to normalize the representation of objects? What do you suggest?

Thank you!

Was it helpful?

Solution

In short, no, it is not possible. It may be possible to represent your objects' vectors as member of the same (higher dimensional space). This will only work if there is some overlap in the features of your objects. Consider the following vectors:

 object1: a, b, c, d
 object2: b, d, e
 object2: a, d

The set of all features is {a, b, c, d, e}, and the three objects can be represented as follows:

 object1: a, b, c, d, 0
 object2: 0, b, 0, d, e
 object2: a, 0, 0, d, 0

0 is a placeholder indicating this object does not have the particular features. Your objects now live in the same 5-dimensional space and can be clustered.

Note: any sane vector library will actually store vector in a sparse format such as the one in my first example. This gives you a very small memory footprint if only a few features are non-zero. The format of my second example is dense. Some libraries expect a dense input and some can to convert from one to the other automatically. In any case I think it is unlikely that you will have to manually do the conversion I did above.


Edit: feature vectors need to end up as list of integers. Each position in the list corresponds to a particular piece of information. You might start with the following features:

cat- weight 4 kg, is very cute
whale- weight 3000kg, not very cute, lives in ocean
rat- weight 0.3 kg, not cute at all, lives in sewers

So cat here is represented by 2 features, or a vector of dimensionality 2. This information translates to the following table:

        weight(kg)    cuteness(%)     lives_in_sewers?    lives_in_ocean?
 cat        4             8                  0                  0
 whale    3000            3                  0                  1
 rat       0.2            2                  1                  0

The feature vectors are:

cat = [4, 8, 0, 0]
whale = [3000m 3, 0, 1]
rat = [0.2, 2, 1, 0]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top