PCA vs.KernelPCA: which one to use for high dimensional data?

https://datascience.stackexchange.com/questions/75439

11-12-2020
|

Question

I have a dataset which contains a lot of features (>>3). For computational reasons, I would like to apply a dimensionality reduction. At this point I could use different techniques:

standard PCA Kernel PCA LLE ... My problem is to choose the right approach since the number of features is so high that I cannot know beforehand what the distribution of points is like. I could do it only if I have 3D data, but in my case I have much more than that.

I know for example that if the set of points was lineary seperable I could use standard PCA; if it was somehow a sort of concentric circles like shape, then KernelPCA would be a better option.

Therefore how can I know beforehand which dimensionality reduction technique I need to use for high dimensional data?

Solution

The fact is that in Unsupervised algorithms, you never know. That is their main bottleneck. Unsupervised algorithms (Clustering, Dimensionality Reductions, etc.) are based on assumptions. When an assumption is made, then it will be translated into a math algorithm and applied.

Choosing the right thing, as you said, is possible only if you know how is the distribution and/or topology of your data beforehand. But unfortunately it does not happen most of the time. Higher dimensional the data is, more difficult it gets to guess its structure.

If you are using it as a feature extraction step for a supervised task, then the right way is to evaluate the impact of each on your Supervised learning through a statistical model selection (e.g. cross validation).

If you are using them for an unsupervised task like clustering then you may choose some practical criteria (there is NO theoretical one i.e. there is NOT any theoretical justification for clustering task). For example you can visualize them in 2 or 3 dimensions and try to inspect if clusters are right (for instance by some known samples from your data. If you know two extreme cases of different samples, a better clustering puts them in far clusters, etc.)

Again I would emphasize that there is no universally true evaluation for unsupervised tasks like clustering.

Hope it helped!

OTHER TIPS

It can be hard to choose - because it's hard to visualize. However, you probably have a specific goal right? Maximizing some kind of score.

Why don't you try a Grid Search applied to your dimensionality reduction decision? See this.

I'm interested in reading other, more theoretical answers to this question, though.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange