PCA Feature selection using R

Question

In cases where the number of features is either huge or exceeds the number of observations, it is well advised to calculate the principal components based on the transposed dataset. This is especially true in your case because the default implies calculation of a 8603 x 8603 covariance matrix which itself already consumes about 500 MB of memory (oh well, this isn't too much, but hey...).

Assuming that the rows of your matrix X correspond to observations and columns correspond to features, center your data and then perform PCA on the transpose of the centered X. There won't be more eigenpairs than number of observations anyway. Finally, multiply each resulting eigenvector by X^T. You do not need to do the latter for the eigenvalues (see way below for a detailed explanation):

What you want

This code demonstrates the implementation of PCA on the transposed dataset and compares the results of prcomp and the "transposed PCA":

pca.reduced <- function(X, center=TRUE, retX=TRUE) {
  # Note that the data must first be centered on the *original* dimensions
  # because the centering of the 'transposed covariance' is meaningless for
  # the dataset. This is also why Sigma must be computed dependent on N
  # instead of simply using cov().
  if (center) {
    mu <- colMeans(X)
    X <- sweep(X, 2, mu, `-`)
  }
  # From now on we're looking at the transpose of X:
  Xt <- t(X)
  aux <- svd(Xt)
  V <- Xt %*% aux$v
  # Normalize the columns of V.
  V <- apply(V, 2, function(x) x / sqrt(sum(x^2)))
  # Done.
  list(X = if (retX) X %*% V else NULL,
       V = V,
       sd = aux$d / sqrt(nrow(X)-1),
       mean = if (center) mu else NULL)
}

# Example data (low-dimensional, but sufficient for this example):
X <- cbind(rnorm(1000), rnorm(1000) * 5, rnorm(1000) * 3)

original   <- prcomp(X, scale=FALSE)
transposed <- pca.reduced(X)

# See what happens:    
> print(original$sdev)
[1] 4.6468136 2.9240382 0.9681769
> print(transposed$sd)
[1] 4.6468136 2.9240382 0.9681769
> 
> print(original$rotation)
               PC1           PC2          PC3
[1,] -0.0055505001  0.0067322416  0.999961934
[2,] -0.9999845292 -0.0004024287 -0.005547916
[3,]  0.0003650635 -0.9999772572  0.006734371
> print(transposed$V)
              [,1]          [,2]         [,3]
[1,]  0.0055505001  0.0067322416 -0.999961934
[2,]  0.9999845292 -0.0004024287  0.005547916
[3,] -0.0003650635 -0.9999772572 -0.006734371

Details

To see why it is possible to work on the transposed matrix consider the following:

The general form of the eigenvalue equation is

          A x = λ x                               (1)

Without loss of generality, let M be a centered "copy" of your original dataset X. Substitution of M^T M for A yields

          M^T M x = λ x                           (2)

Multiplication of this equation by M yields

          M M^T M x = λ M x                       (3)

Consequent substitution of y = M x yields

          M M^T y = λ y                           (4)

One can already see that y corresponds to an eigenvector of the "covariance" matrix of the transposed dataset (note that M M^T is in fact no real covariance matrix as the dataset X was centered along its columns and not its rows. Also, scaling must be done by means of the number of samples (rows of M) and not the number of features (columns of M resp. rows of M^T).

It can also be seen that the eigenvalues are the same for M M^T and M^T M.

Finally, one last multiplication by M^T results in

          (M^T M) M^T y = λ M^T y                 (5)

where M^T M is the original covariance matrix.

From equation (5) it follows that M^T y is an eigenvector of M^T M with eigenvalue λ.