How to do SVD and PCA with big data?

https://datascience.stackexchange.com/questions/1159

16-10-2019
|

Question

I have a large set of data (about 8GB). I would like to use machine learning to analyze it. So, I think that I should use SVD then PCA to reduce the data dimensionality for efficiency. However, MATLAB and Octave cannot load such a large dataset.

What tools I can use to do SVD with such a large amount of data?

Solution

First of all, dimensionality reduction is used when you have many covariated dimensions and want to reduce problem size by rotating data points into new orthogonal basis and taking only axes with largest variance. With 8 variables (columns) your space is already low-dimensional, reducing number of variables further is unlikely to solve technical issues with memory size, but may affect dataset quality a lot. In your concrete case it's more promising to take a look at online learning methods. Roughly speaking, instead of working with the whole dataset, these methods take a little part of them (often referred to as "mini-batches") at a time and build a model incrementally. (I personally like to interpret word "online" as a reference to some infinitely long source of data from Internet like a Twitter feed, where you just can't load the whole dataset at once).

But what if you really wanted to apply dimensionality reduction technique like PCA to a dataset that doesn't fit into a memory? Normally a dataset is represented as a data matrix X of size n x m, where n is number of observations (rows) and m is a number of variables (columns). Typically problems with memory come from only one of these two numbers.

Too many observations (n >> m)

When you have too many observations, but the number of variables is from small to moderate, you can build the covariance matrix incrementally. Indeed, typical PCA consists of constructing a covariance matrix of size m x m and applying singular value decomposition to it. With m=1000 variables of type float64, a covariance matrix has size 1000*1000*8 ~ 8Mb, which easily fits into memory and may be used with SVD. So you need only to build the covariance matrix without loading entire dataset into memory - pretty tractable task.

Alternatively, you can select a small representative sample from your dataset and approximate the covariance matrix. This matrix will have all the same properties as normal, just a little bit less accurate.

Too many variables (n << m)

On another hand, sometimes, when you have too many variables, the covariance matrix itself will not fit into memory. E.g. if you work with 640x480 images, every observation has 640*480=307200 variables, which results in a 703Gb covariance matrix! That's definitely not what you would like to keep in memory of your computer, or even in memory of your cluster. So we need to reduce dimensions without building a covariance matrix at all.

My favourite method for doing it is Random Projection. In short, if you have dataset X of size n x m, you can multiply it by some sparse random matrix R of size m x k (with k << m) and obtain new matrix X' of a much smaller size n x k with approximately the same properties as the original one. Why does it work? Well, you should know that PCA aims to find set of orthogonal axes (principal components) and project your data onto first k of them. It turns out that sparse random vectors are nearly orthogonal and thus may also be used as a new basis.

And, of course, you don't have to multiply the whole dataset X by R - you can translate every observation x into the new basis separately or in mini-batches.

There's also somewhat similar algorithm called Random SVD. I don't have any real experience with it, but you can find example code with explanations here.

As a bottom line, here's a short check list for dimensionality reduction of big datasets:

If you have not that many dimensions (variables), simply use online learning algorithms.
If there are many observations, but a moderate number of variables (covariance matrix fits into memory), construct the matrix incrementally and use normal SVD.
If number of variables is too high, use incremental algorithms.

OTHER TIPS

Don't bother.

First rule of programming- which also applies to data science: get everything working on a small test problem.

so take a random sample of your data of say 100,000 rows. try different algorithms etc. once you have got everything working to your satisfaction, you can try larger (and larger) data sets - and see how the test error reduces as you add more data.

furthermore you do not want to apply svd to only 8 columns: you apply it when you have a lot of columns.

PCA is usually implemented by computing SVD on the covariance matrix.

Computing the covariance matrix is an embarrassingly parallel task, so it scales linear with the number of records, and is trivial to distribute on multiple machines!

Just do one pass over your data to compute the means. Then a second pass to compute the covariance matrix. This can be done with map-reduce easily - essentially it's the same as computing the means again. Sum terms as in covariance are trivial to parallelize! You may only need to pay attention to numerics when summing a lot of values of similar magnitude.

Things get different when you have a huge number of variables. But on an 8 GB system, you should be able to run PCA on up to 20.000 dimensions in-memory with the BLAS libraries. But then you may run into the problem that PCA isn't all that reliable anymore, because it has too many degrees of freedom. In other words: it overfits easily. I've seen the recommendation of having at least 10*d*d records (or was it d^3). So for 10000 dimensions, you should have at least a billion records (of 10000 dimensions... that is a lot!) for the result to be statistically reliable.

Although you can probably find some tools that will let you do it on a single machine, you're getting into the range where it make sense to consider "big data" tools like Spark, especially if you think your data set might grow. Spark has a component called MLlib which supports PCA and SVD. The documentation has examples.

We implemented SVD to a larger data set using PySpark. We also compared consistency across different packages. Here is the link.

I would reccomend python if you lazily evaluate the file you will have a miniscule memory footprint, and numpy/scipy give you access to all of the tools Octave/Matlab would.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange