Question

I am a biologist. An output of my experiment contains large number of features(which are stored as numbers of columns and 563 rows). The columns are the features which are 8603 in number which are quite high.

So, when I tried to do PCA analysis in R and it gives "out of memory" errors.

I have tried also doing princomp in pieces, but it does not seem to work for our approach.

I tried using the Script given in the link...

http://www.r-bloggers.com/introduction-to-feature-selection-for-bioinformaticians-using-r-correlation-matrix-filters-pca-backward-selection/

But still it does not wok :(

I am trying to use the following code

bumpus <- read.table("http://www.ndsu.nodak.edu/ndsu/doetkott/introsas/rawdata/bumpus.html", 
                     skip=20, nrows=49, 
                     col.names=c("id","total","alar","head","humerus","sternum"))

boxplot(bumpus, main="Boxplot of Bumpus' data") ## in this step it is showing the ERROR

# we first standardize the data:
bumpus.scaled <- data.frame( apply(bumpus,2,scale) )
boxplot(bumpus.scaled, main="Boxplot of standardized Bumpus' data")

pca.res <- prcomp(bumpus.scaled, retx=TRUE)
pca.res

# note:
# PC.1 is some kind of average of all the measurements 
#    => measure of size of the bird
# PC.2 has a negative weight for 'sternum' 
#    and positive weights for 'alar', 'head' and 'humerus'
#    => measure of shape of the bird

# first two principal components:
pca.res$x[,1:2]
plot(pca.res$x[,1:2], pch="", main="PC.1 and PC.2 for Bumpus' data (blue=survived, red=died)")
text(pca.res$x[,1:2], labels=c(1:49), col=c(rep("blue",21),rep("red",28)))
abline(v=0, lty=2)
abline(h=0, lty=2)

# compare to segment plot:
windows()
palette(rainbow(12, s = 0.6, v = 0.75)) 
stars(bumpus, labels=c(1:49), nrow=6, key.loc=c(20,-1), 
      main="Segment plot of Bumpus' data", draw.segment=TRUE) 

# compare to biplot:
windows()
biplot(pca.res, scale=0)
# what do the arrows mean?
# consider the arrow for sternum:
abline(0, pca.res$rotation[5,2]/pca.res$rotation[5,1])
# consider the arrow for head:
abline(0, pca.res$rotation[3,2]/pca.res$rotation[3,1])

But second line

boxplot(bumpus, main="Boxplot of Bumpus' data") ## shows an error

The error is

Error: cannot allocate vector of size 1.4 Mb

In addition: There were 27 warnings (use warnings() to see them)

Please help!

Was it helpful?

Solution

In cases where the number of features is either huge or exceeds the number of observations, it is well advised to calculate the principal components based on the transposed dataset. This is especially true in your case because the default implies calculation of a 8603 x 8603 covariance matrix which itself already consumes about 500 MB of memory (oh well, this isn't too much, but hey...).

Assuming that the rows of your matrix X correspond to observations and columns correspond to features, center your data and then perform PCA on the transpose of the centered X. There won't be more eigenpairs than number of observations anyway. Finally, multiply each resulting eigenvector by X^T. You do not need to do the latter for the eigenvalues (see way below for a detailed explanation):

What you want

This code demonstrates the implementation of PCA on the transposed dataset and compares the results of prcomp and the "transposed PCA":

pca.reduced <- function(X, center=TRUE, retX=TRUE) {
  # Note that the data must first be centered on the *original* dimensions
  # because the centering of the 'transposed covariance' is meaningless for
  # the dataset. This is also why Sigma must be computed dependent on N
  # instead of simply using cov().
  if (center) {
    mu <- colMeans(X)
    X <- sweep(X, 2, mu, `-`)
  }
  # From now on we're looking at the transpose of X:
  Xt <- t(X)
  aux <- svd(Xt)
  V <- Xt %*% aux$v
  # Normalize the columns of V.
  V <- apply(V, 2, function(x) x / sqrt(sum(x^2)))
  # Done.
  list(X = if (retX) X %*% V else NULL,
       V = V,
       sd = aux$d / sqrt(nrow(X)-1),
       mean = if (center) mu else NULL)
}

# Example data (low-dimensional, but sufficient for this example):
X <- cbind(rnorm(1000), rnorm(1000) * 5, rnorm(1000) * 3)

original   <- prcomp(X, scale=FALSE)
transposed <- pca.reduced(X)

# See what happens:    
> print(original$sdev)
[1] 4.6468136 2.9240382 0.9681769
> print(transposed$sd)
[1] 4.6468136 2.9240382 0.9681769
> 
> print(original$rotation)
               PC1           PC2          PC3
[1,] -0.0055505001  0.0067322416  0.999961934
[2,] -0.9999845292 -0.0004024287 -0.005547916
[3,]  0.0003650635 -0.9999772572  0.006734371
> print(transposed$V)
              [,1]          [,2]         [,3]
[1,]  0.0055505001  0.0067322416 -0.999961934
[2,]  0.9999845292 -0.0004024287  0.005547916
[3,] -0.0003650635 -0.9999772572 -0.006734371

Details

To see why it is possible to work on the transposed matrix consider the following:

The general form of the eigenvalue equation is

          A x = λ x                               (1)

Without loss of generality, let M be a centered "copy" of your original dataset X. Substitution of M^T M for A yields

          M^T M x = λ x                           (2)

Multiplication of this equation by M yields

          M M^T M x = λ M x                       (3)

Consequent substitution of y = M x yields

          M M^T y = λ y                           (4)

One can already see that y corresponds to an eigenvector of the "covariance" matrix of the transposed dataset (note that M M^T is in fact no real covariance matrix as the dataset X was centered along its columns and not its rows. Also, scaling must be done by means of the number of samples (rows of M) and not the number of features (columns of M resp. rows of M^T).

It can also be seen that the eigenvalues are the same for M M^T and M^T M.

Finally, one last multiplication by M^T results in

          (M^T M) M^T y = λ M^T y                 (5)

where M^T M is the original covariance matrix.

From equation (5) it follows that M^T y is an eigenvector of M^T M with eigenvalue λ.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top