Clustering in High Dimensions + some basic stuff [closed]

Question

You may be interested in this survey:

Kriegel, H. P., Kröger, P., & Zimek, A. (2009).
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering.
ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1), 1.

one of the authors wrote DBSCAN, so it will likely help you shed some light in your DBSCAN questions.

100 dimensional data can be high-dimensional data. If it isn't sparse. For the NLP people, 100d is laughably little, but their data is special. It is derived essentially from a binary nature (word present, or not present), so it has actually less than 1 bit of information in each dimension... if you have dense 100 dimensional data, you usually are in trouble.

There are some nice figures in a related / follow up survey by the same authors:

Zimek, A., Schubert, E., & Kriegel, H. P. (2012).
A survey on unsupervised outlier detection in high‐dimensional numerical data.
Statistical Analysis and Data Mining, 5(5), 363-387.

They have analyzed the behavior of distance functions nicely for such data. The essence is: high-dimensional data can be hard - or easy; it all depends on the signal to noise ratio. If you only have dimensions carrying signal, additional dimensions can make your problems actually easier. If the additional dimensions are distracting, things can break down.

Which may also explain why the "kernel trick" with SVMs works - it does not really add information content; the increased dimensionality is only virtual, not intrinsic. You have a larger search and solution space; but your data is still on a lower-dimensional manifold within this space.

k-means results in high-dimensional data tend to get meaningless. In many cases, they still work "good enough"; because often quality does not really matter a lot, and any convex partitioning will do (e.g. bag-of-words approaches for image similarity don't seem to improve substantially with "better" k-means clusterings)

CURE, which also seems to use sum-of-squares (like k-means) should suffer from the same problems. For large data, all sum-of-squares values become increasingly similar (i.e. any partitioning is as good as any other).

Yes, there are plenty of textbooks, surveys, and studies that tried to compare clustering algorithms. But in the end, there are too many factors involved: what does your data look like, how did you preprocess it, do you have a well-chosen and appropriate distance measure, how good is your implementation, do you have index acceleration to speed up some algorithms, etc. - there is no rule of thumb; you will have to try out things.