Question

I am using princomp in R to perform PCA. My data matrix is huge (10K x 10K with each value up to 4 decimal points). It takes ~3.5 hours and ~6.5 GB of Physical memory on a Xeon 2.27 GHz processor.

Since I only want the first two components, is there a faster way to do this?

Update :

In addition to speed, Is there a memory efficient way to do this ?

It takes ~2 hours and ~6.3 GB of physical memory for calculating first two components using svd(,2,).

Was it helpful?

Solution

You sometimes gets access to so-called 'economical' decompositions which allow you to cap the number of eigenvalues / eigenvectors. It looks like eigen() and prcomp() do not offer this, but svd() allows you to specify the maximum number to compute.

On small matrices, the gains seem modest:

R> set.seed(42); N <- 10; M <- matrix(rnorm(N*N), N, N)
R> library(rbenchmark)
R> benchmark(eigen(M), svd(M,2,0), prcomp(M), princomp(M), order="relative")
          test replications elapsed relative user.self sys.self user.child
2 svd(M, 2, 0)          100   0.021  1.00000      0.02        0          0
3    prcomp(M)          100   0.043  2.04762      0.04        0          0
1     eigen(M)          100   0.050  2.38095      0.05        0          0
4  princomp(M)          100   0.065  3.09524      0.06        0          0
R> 

but the factor of three relative to princomp() may be worth your while reconstructing princomp() from svd() as svd() allows you to stop after two values.

OTHER TIPS

The 'svd' package provides the routines for truncated SVD / eigendecomposition via Lanczos algorithm. You can use it to calculate just first two principal components.

Here I have:

> library(svd)
> set.seed(42); N <- 1000; M <- matrix(rnorm(N*N), N, N)
> system.time(svd(M, 2, 0))
   user  system elapsed 
  7.355   0.069   7.501 
> system.time(princomp(M))
   user  system elapsed 
  5.985   0.055   6.085 
> system.time(prcomp(M))
   user  system elapsed 
  9.267   0.060   9.368 
> system.time(trlan.svd(M, neig = 2))
   user  system elapsed 
  0.606   0.004   0.614 
> system.time(trlan.svd(M, neig = 20))
   user  system elapsed 
  1.894   0.009   1.910
> system.time(propack.svd(M, neig = 20))
   user  system elapsed 
  1.072   0.011   1.087 

I tried the pcaMethods package's implementation of the nipals algorithm. By default it calculates the first 2 principal components. Turns out to be slower than the other suggested methods.

set.seed(42); N <- 10; M <- matrix(rnorm(N*N), N, N)
library(pcaMethods)
library(rbenchmark)
m1 <- pca(M, method="nipals", nPcs=2)
benchmark(pca(M, method="nipals"),
          eigen(M), svd(M,2,0), prcomp(M), princomp(M), order="relative")

                       test replications elapsed relative user.self sys.self
3              svd(M, 2, 0)          100    0.02      1.0      0.02        0
2                  eigen(M)          100    0.03      1.5      0.03        0
4                 prcomp(M)          100    0.03      1.5      0.03        0
5               princomp(M)          100    0.05      2.5      0.05        0
1 pca(M, method = "nipals")          100    0.23     11.5      0.24        0

The power method might be what you want. If you code it in R, which is not hard at all, I think you may find that it is no faster than the SVD approach suggested in other answer, which makes use of LAPACK compiled routines.

you can use neural network approach to find the principal component. Basic description is given here.. http://www.heikohoffmann.de/htmlthesis/node26.html

First principal Component, y= w1*x1+w2*x2 and Second orthogonal Component can be calculated as q = w2*x1-w1*x2.

The "gmodels" and "corpcor" R packages come with faster implementations of SVD and PCA. These perform similarly to the core versions for small matrices:

> set.seed(42); N <- 10; M <- matrix(rnorm(N*N), N*N, N)
> library("rbenchmark")
> library("gmodels")    
> benchmark(svd(M,2,0), svd(M), gmodels::fast.svd(M), corpcor::fast.svd(M), prcomp(M), gmodels::fast.prcomp(M), princomp(M), order="relative")
                     test replications elapsed relative user.self sys.self user.child sys.child
1            svd(M, 2, 0)          100   0.005      1.0     0.005    0.000          0         0
2                  svd(M)          100   0.006      1.2     0.005    0.000          0         0
3    gmodels::fast.svd(M)          100   0.007      1.4     0.006    0.000          0         0
4    corpcor::fast.svd(M)          100   0.007      1.4     0.007    0.000          0         0
6 gmodels::fast.prcomp(M)          100   0.014      2.8     0.014    0.000          0         0
5               prcomp(M)          100   0.015      3.0     0.014    0.001          0         0
7             princomp(M)          100   0.030      6.0     0.029    0.001          0         0
> 

However, they provide a faster result for larger matrices (especially those with many rows).

> set.seed(42); N <- 10; M <- matrix(rnorm(N*N), N*N*N, N)
> library("rbenchmark")
> library("gmodels")
> benchmark(svd(M,2,0), svd(M), gmodels::fast.svd(M), corpcor::fast.svd(M), prcomp(M), gmodels::fast.prcomp(M), order="relative")

                     test replications elapsed relative user.self sys.self user.child sys.child
4    corpcor::fast.svd(M)          100   0.029    1.000     0.028    0.001          0         0
3    gmodels::fast.svd(M)          100   0.035    1.207     0.033    0.001          0         0
2                  svd(M)          100   0.037    1.276     0.035    0.002          0         0
1            svd(M, 2, 0)          100   0.039    1.345     0.037    0.001          0         0
5               prcomp(M)          100   0.068    2.345     0.061    0.006          0         0
6 gmodels::fast.prcomp(M)          100   0.068    2.345     0.060    0.007          0         0

I am surprised nobody mentioned the irlba package yet:

It is even a bit faster than svd's propack.svd, provides with irlba::prcomp_irlba(X, n=2) a stats::prcomp-like interface for convenience and did not require parameter adjustments in the following benchmark for rectangular matrices (2:1) of varying size. For matrices of size 6000x3000 it is 50 times faster than stats::prcomp. For matrices smaller than 100x50 stats::svd is still faster though.

benchmark results

library(microbenchmark)
library(tidyverse)
#install.packages("svd","corpcor","irlba","rsvd")

exprs <- rlang::exprs(
  svd(M, 2, 2)$v,
  prcomp(M)$rotation[,1:2],
  irlba::prcomp_irlba(M, n=2)$rotation,
  irlba::svdr(M, k=2)$v,
  rsvd::rsvd(M, 2)$v,
  svd::propack.svd(M, neig=2, opts=list(maxiter=100))$v,
  corpcor::fast.svd(M)$v[,1:2]
)

set.seed(42)
tibble(N=c(10,30,100,300,1000,3000)) %>%
  group_by(N) %>%
  do({
    M <- scale(matrix(rnorm(.$N*.$N*2), .$N*2, .$N))
    microbenchmark(!!!exprs,
      times=min(100, ceiling(3000/.$N)))%>%
      as_tibble
  }) %>% 
ggplot(aes(x=N, y=time/1E9,color=expr)) +
  geom_jitter(width=0.05) +
  scale_x_log10("matrix size (2N x N)") +
  scale_y_log10("time [s]") +
  stat_summary(fun.y = median, geom="smooth") +
  scale_color_discrete(labels = partial(str_wrap, width=30))

The randomized svd provided by rsvd is even faster, but unfortunately, quite off:

set.seed(42)
N <- 1000
M <- scale(matrix(rnorm(N^2*2), N*2, N))
cor(set_colnames(sapply(exprs, function(x) eval(x)[,1]), sapply(exprs, deparse)))
                                                       svd(M, 2, 2)$v prcomp(M)$rotation[, 1:2] irlba::prcomp_irlba(M, n = 2)$rotation irlba::svdr(M, k = 2)$v rsvd::rsvd(M, 2)$v svd::propack.svd(M, neig = 2, opts = list(maxiter = 100))$v corpcor::fast.svd(M)$v[, 1:2]
svd(M, 2, 2)$v                                                   1.0000000                 1.0000000                             -1.0000000               0.9998748           0.286184                                                   1.0000000                     1.0000000
prcomp(M)$rotation[, 1:2]                                        1.0000000                 1.0000000                             -1.0000000               0.9998748           0.286184                                                   1.0000000                     1.0000000
irlba::prcomp_irlba(M, n = 2)$rotation                          -1.0000000                -1.0000000                              1.0000000              -0.9998748          -0.286184                                                  -1.0000000                    -1.0000000
irlba::svdr(M, k = 2)$v                                          0.9998748                 0.9998748                             -0.9998748               1.0000000           0.290397                                                   0.9998748                     0.9998748
rsvd::rsvd(M, 2)$v                                               0.2861840                 0.2861840                             -0.2861840               0.2903970           1.000000                                                   0.2861840                     0.2861840
svd::propack.svd(M, neig = 2, opts = list(maxiter = 100))$v      1.0000000                 1.0000000                             -1.0000000               0.9998748           0.286184                                                   1.0000000                     1.0000000
corpcor::fast.svd(M)$v[, 1:2]                                    1.0000000                 1.0000000                             -1.0000000               0.9998748           0.286184                                                   1.0000000                     1.0000000

This might bet better when the data actually has a structure though.

You could write the function yourself and stop at 2 components. It is not too difficult. I have it laying around somewhere, if I find it I will post it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top