Question

The data set I am trying to cluster is made of multiple heterogeneous dimensions.
For example

<A, B, C, D> 

where A, B is lat, long.
C is a number.
D is a binary value.

What is the best way to approach a clustering problem in this case? Should I normalise the data to make it homogeneous, or I should run a separate clustering problem for each homogeneous set of dimensions?

Was it helpful?

Solution

k-means is not a good choice, as it will not handle the 180° wrap-around, and distances anywhere but the equator will be distorted. IIRC in northern USA and most parts of Europe, the distortion is over 20% already.

Similar, it does not make sense to use k-means on binary data - the mean does not make sense, to be precise.

Use an algorithm that can work with arbitrary distances, and construct a combined distance function that is designed for solving your problem, on your particular data set.

Then use e.g. PAM or DBSCAN or hierarchical linkage clustering any other algorithm that works with arbitrary distance functions.

OTHER TIPS

The mean of a binary feature can be seen as the frequency of that feature. There are cases in which one can standardise a binary feature v by v-\bar{v}.

However, in your case it seems to me that you have three features in three different feature spaces. I'd approach this problem by creating three distances d_v, one appropriate for each feature v \in V. The distance between two entities, say x and y would be given by d(x,y) \sum_{v \in V} w_v d_v(x_{v}, y_{v}). You could play with w_v, but I'd probably constraint it to \sum_{v \in V} w_v =1 and {v}_{v \in V} \geq 0.

The above are just some quick thoughts on it, good luck! PS: Sorry for the text, I'm new here and I don't know how to put latex text here

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top