Question

I find many algorithms are used for dimensionality reduction. The more commonly used ones (e.g. on this page ) are:

Principal component analysis (PCA).
Factor Analysis
Independent component analysis (ICA).
Non-negative matrix factorization (NMF or NNMF)
Latent Dirichlet Allocation (LDA)

What are major differences between these? How does one choose one over the other? Thanks for your insight.

Was it helpful?

Solution

A detailed answer would require many pages of explanation, but I think a brief answer may point to the right direction for further research.

First of all the choice of dimensionality reduction algorithm depends on the problem and data at hand. There is no golden standard. Your problem requirements dictate the best option(s) to try out.

The main concept of dimensionality reduction is to find alternative representations for the data that have less "dimensions" while at the same time keeping most of the original information contained in the data.

Right from the start, one sees that there is some trade-off between space/dimensions used and information kept. The less space/dimensions kept means also that the less original information contained in the data is kept.

The trick is to try to leave out redundant and useless information for the problem at hand. This is why the choice of algorithm depends crucialy on the problem and data at hand. Then one can indeed reduce space/dimensions of the data while at the same time keeping all relevant information.

In order to do that and depending on the nature and characteristics of data one can try some approaches:

1. PCA and variants

This factors the data to "principal/decorrelated components" and keeps those with most variance and discards the rest (as irrelevant and noise). This decomposes the data according to statistical variance (ie 2nd-order statistics are used) usualy performed by variants of EVD/SVD on correlation matrix of data.

PCA is the best, in the mean-square error sense, linear dimension reduction technique.

2. ICA

This factors the data into "independent components" which means that it uses higher-order statistics than 2nd-order which imply decorrelation only. However depending on algorithm, some data may not work well with ICA, since they must not be normally distributed (since for normal random variables decorrelation implies independece as well). Note: PCA is a pre-processing step in most ICA algorithms (eg JADE, ..)

ICA is a higher-order method that seeks linear projections, not necessarily orthogonal to each other, that are as nearly statistically independent as possible. Statistical independence is a much stronger condition than uncorrelatdness.

3. Dictionary and variants

Note that the above algorithms result in a set of "basic" components which can form the basis for all the data. Like the basis vectors of a vector space, each datum is a specific combination/sample of the basic components. Or like a dictionary each datum is a combination/sample of elements from this dictionary. But what if one knows the basic dictionary before hand for a certain problem. Then one can try to find the optimum representation of each datum wrt this basic dictionary. Or can try to learn adaptively this basic dictionary using some adaptve learning method.

4. Factor analysis

Also note that the first 2 approaches extract a set of basic factors from the data (ie they are equivalent to data factors). But what if one assumes a more general probabilistic setting for extracting (linear or non-linear) data factors (factor analysis). For example PCA/ICA can be seen as a specific example of factor analysis where factors are required to be uncorrelated/independent. This approach is the generic probabilistic form.

5. Other data factorisation methods..

One can get the idea, if the data have certain properties that can be exploited in learning the optimum minimum dimension for represetnation one can try variations of data factorisation methods exploiting those data properies.

6. Unsupervised clustering methods..

Being able to find optimum representations of data automaticaly is a great advantage. Unsuperivised clustering algorithms can be used to this end, in the sense that they can try to cluster the data in an unsupervised manner (no prior information given) and then the cluster representatives can be chosen as the dictionary or basis factors that best represent the data as a whole. This results in dimensionality reduction (eg k-means, vector quantisation, ..)

References for further research:

  1. A survey of dimension reduction techniques

Advances in data collection and storage capabilities during the past decades have led to an information overload in most sciences. Researchers working in domains as diverse as engineering, astronomy, biology, remote sensing, economics,and consumer transactions, face larger and larger observations and simulations on a dailybasis. Such datasets, in contrast with smaller, more traditional datasets that have been studied extensively in the past, present new challenges in data analysis. Traditional statistical methods break down partly because of the increase in the number of observations, but mostly because of the increase in the number of variables associated with each observation. The dimension of the datais the number of variables that are measured on each observation. High-dimensional datasets present many mathematical challenges as well as some opportunities, and are bound to give rise to new theoretical developments. One of the problems with high-dimensional datasets is that, in many cases, not all the measured variables are "important" for understanding the underlying phenomena of interest. While certain computationally expensive novel methods can construct predictive models with highaccuracy from high-dimensional data, it is still of interest in many applications to reduce the dimension of the original data prior to any modeling of the data.

  1. A survey of dimensionality reduction techniques

Experimental life sciences like biology or chemistry have seen in the recent decades an explosion of the data available from experiments. Laboratory instruments become more and more complex and report hundreds or thousands measurements for a single experiment and therefore the statistical methods face challenging tasks when dealing with such high dimensional data. However, much of the data is highly redundant and can be efficiently brought down to a much smaller number of variables without a significant loss of information. The mathematical procedures making possible this reduction are called dimensionality reduction techniques; they have widely been developed by fields like Statistics or Machine Learning, and are currently a hot research topic. In this review we categorize the plethora of dimension reduction techniques available and give the mathematical insight behind them.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top