Question

I've been studying Support Vector Machines(SVM) for a while, and recently started reading articles on Clustering. When using SVM, we did not need to worry about the dimension size of the data, however, I learned that in clustering, due to the "Curse of Dimensionality", the dimension size is of big issue. Furthermore, the sparsity and the data size greatly affects the clustering algorithms you choose as well. So I kind of understand that there is no "best algorithm" for clustering, and it all depends on the nature of the data.

Having said that, I want to ask some really basic questions on Clustering.

  1. When people say "High Dimension", what do they mean specifically?? Is 100d a high dimension?? Or does this depend on the type of data you have?

  2. I've seen answers on this website that said something like, "using k-means on data with 100's dimensions is very usual", and if this is true, does this hold true for other clustering algorithms that uses the same distance metric as k-means??

  3. In pp.649 of the paper, "Survey of Clustering Algorithms"(http://goo.gl/WQyuxo), by Rui Xu et al., the table shows that CURE has "the capability of tackling high dimensional data", and was wondering if anybody has any ideas on how high of dimension they are talking about.

  4. If I wanted to perform clustering on high dimensional datas with adequate size, which was randomly sampled from the initial big data, what kind of algorithms would be appropriate to use?? I understand that density based algorithms such as DBSCAN does not perform well under random sampling.

  5. Can anybody tell me how well/bad CURE performs on high dimensional datas?? Intuitively, I guess CURE does not perform well considering the "Cure of Dimensionality", however, it would be great if there were some detailed results.

  6. Are there any websites/papers/textbooks on explaining the pros and cons of clustering algorithms?? I've seen some papers on the pros/cons of basic algorithms, i.e, k-means, hierarchal clustering, DBSCAN, etc., but wanted to know more on other algorithms such as CURE, CLIQUE, CHAMELEON, etc.

Sorry for asking so much questions all at once!! It will be awesome if anybody could answer any one of my questions. Also, if I had ill-stated a question or asked a completely pointless question, don't hesitate to tell me. And if anybody knows a great textbook/survey paper on Clustering that elaborates on these subjects, please tell me!! Thank you in advance.

Was it helpful?

Solution

You may be interested in this survey:

Kriegel, H. P., Kröger, P., & Zimek, A. (2009).
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering.
ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1), 1.

one of the authors wrote DBSCAN, so it will likely help you shed some light in your DBSCAN questions.

100 dimensional data can be high-dimensional data. If it isn't sparse. For the NLP people, 100d is laughably little, but their data is special. It is derived essentially from a binary nature (word present, or not present), so it has actually less than 1 bit of information in each dimension... if you have dense 100 dimensional data, you usually are in trouble.

There are some nice figures in a related / follow up survey by the same authors:

Zimek, A., Schubert, E., & Kriegel, H. P. (2012).
A survey on unsupervised outlier detection in high‐dimensional numerical data.
Statistical Analysis and Data Mining, 5(5), 363-387.

They have analyzed the behavior of distance functions nicely for such data. The essence is: high-dimensional data can be hard - or easy; it all depends on the signal to noise ratio. If you only have dimensions carrying signal, additional dimensions can make your problems actually easier. If the additional dimensions are distracting, things can break down.

Which may also explain why the "kernel trick" with SVMs works - it does not really add information content; the increased dimensionality is only virtual, not intrinsic. You have a larger search and solution space; but your data is still on a lower-dimensional manifold within this space.

k-means results in high-dimensional data tend to get meaningless. In many cases, they still work "good enough"; because often quality does not really matter a lot, and any convex partitioning will do (e.g. bag-of-words approaches for image similarity don't seem to improve substantially with "better" k-means clusterings)

CURE, which also seems to use sum-of-squares (like k-means) should suffer from the same problems. For large data, all sum-of-squares values become increasingly similar (i.e. any partitioning is as good as any other).

Yes, there are plenty of textbooks, surveys, and studies that tried to compare clustering algorithms. But in the end, there are too many factors involved: what does your data look like, how did you preprocess it, do you have a well-chosen and appropriate distance measure, how good is your implementation, do you have index acceleration to speed up some algorithms, etc. - there is no rule of thumb; you will have to try out things.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top