Clustering high dimensional data

https://datascience.stackexchange.com/questions/16532

16-10-2019
|

Question

TL;DR: Given a big image dataset (around 36 GiB of raw pixels) of unlabeled data, how can I cluster the images (based on the pixel values) without knowing the number of clusters K to begin with?

I am currently working on an unsupervised learning project to cluster images; think of it as clustering MNIST with 16x16x3 RGB pixel values, only that I have about 48 million examples that I need to cluster. Without knowing their identities, I do know that some of the images are definitely related because they come from the same source, but - say - I also don't know an appropriate K in order to "just" run K-means on the set yet.

I was thinking of doing some manual 2D embedding using t-SNE and then clustering manually in the embedded space (a simpler task than doing it manually in 16x16x3-d), but all t-SNE implementations I could find required loading the data into memory. I also thought about first running t-SNE, then K-means on the t-SNE embedded data, but if you look at the results of t-SNE from MNIST, it's very obvious that these clusters might and probably will be distorted and skewed in nonlinear ways. So even if I were to know a K, the clusters would probably be split up. Using Mahalanobis distances for K-means might be an interesting thing, but since I don't know covariances to begin with, that appears to be a dead end as well.

Currently I'm trying if I can run PCA compression on the examples to at least gain some memory back for t-SNE, but that might or might not work ... can't say for now.

Can somebody give me a pointer in the right direction to do this (ideally, but definitely not necessary in a Python, TensorFlow or Apache Beam/Dataflow context)? I was working on porting a Streaming/Ball K-means a while ago which does have the nice property of creating new clusters "on demand", but before I start implementing that again in Python/TensorFlow/Dataflow, I was hoping that somebody could give me some ideas where to start or what to avoid.

Solution

I don't think any of the clustering techniques "just" work at such scale. The most scalable supposedly is k-means (just do not use Spark/Mahout, they are really bad) and DBSCAN (there are some good distributed versions available).

But you will be facing many other challenges besides scale because clustering is difficult. It's not as if it's just enough to run the algorithm and then you have clusters. Clustering is an explorative technique. There is no "correct" clustering. But rather you will need to run clustering again and again, and look at every cluster. Because there will not be a single parameter setting that gets everything right. Instead, different clusters may appear only at different parameters.

But the main challenge in your case will likely be the distance function. Except for idealized settings like MNIST, Euclidean distance will not work at all. Nor will anything working on the raw pixels. So first you need to do feature extraction, then define a similarity function.

When it comes to clustering, work with a sample. Cluster the sample, identify interesting clusters, then think of a way to generalize the label to your entire data set. For example by classification (your labeled data points are your training set, predict the labels of unlabeled points).

OTHER TIPS

If you're trying to do dimensionality reduction, you should use Mahout- it is best in class, and the only open source project afaik to offer truly distributed versions of PCA / SVD.

http://mahout.apache.org/docs/0.13.1-SNAPSHOT/algorithms/linear-algebra/d-spca.html

http://mahout.apache.org/docs/0.13.1-SNAPSHOT/algorithms/linear-algebra/d-ssvd.html

Mahout also has DBSCAN implementation, WIP as part of a Google Summer of Code project, worth keeping an eye on.

I'm guessing Anony's "Mahout is Bad" remark (I agree Spark is) is relevant to the deprecated MapReduce version of Mahout (as the new version hasn't implemented K-Means out of the box yet, though there is a discussion on the mailing list about how to do this fairly easily).

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange