Understanding how distributed PCA works

https://datascience.stackexchange.com/questions/18447

22-10-2019
|

문제

As part of big data analysis project, I'm working on,

I need to perform PCA on some data, using cloud computing system.

In my case, I'm using Amazon EMR for the job and Spark in particular.

Leaving the "How-to-perform-PCA-in-spark" question aside, I want to get an understanding of how things work behind the scenes when it comes to calculating PCs on cloud-based architecture.

For example, one of the means to determine PCs of a data is to calculate covariance matrix of the features.

When using HDFS based architecture for example, the original data is distributed across multiple nodes, I'm guessing each node receives X records.

How then is the covariance matrix calculated in such case when each node have only partial data?

This is just an example. I'm trying to find some paper or documentation explaining all this behind-the-scenes voodoo, and couldn't find anything good enough for my needs (probably my poor google skills).

So I can basically summarize my question(s) \ needs to be the following:

1. How distributed PCA on cloud architecture works

Preferably some academic paper or other sorts of explanation which also contains some visuals

2. Spark implementation of D-PCA

How does Spark do it? Do they have any 'twist' in their architecture to do it more efficiently, or how does the RDD objects usage contribute to improving the efficiency? etc.

A presentation of even an online lesson regarding it would be great.

Thanks in advance to anyone who can provide some reading material.

해결책

The question is more related to Apache Spark architecture and map reduce; there are more than one questions here, however, the central piece of your question perhaps is

For example, one of the means to determine PCs of a data is to calculate covariance matrix of the features.

When using HDFS based architecture for example, the original data is distributed across multiple nodes, I'm guessing each node receives X records.

How then is the covariance matrix calculated in such case when each node have only partial data?

I shall address that, which hopefully will clear the matter to a degree.

Let us look at a common form of covariance calculation, $\frac{1}{n}\sum(x-\bar{x})(y-\bar{y})$

This requires you to calculate the following:

$\bar{x}$
$\bar{y}$
$x-\bar{x}$ and $y-\bar{y}$
Multiply the $(x-\bar{x})$ and $(y-\bar{y})$

in a distributed manner. The rest is simple, let us say I have 100 datat points (x,y), which is distributed to 10 Apache Spark workers, each getting 10 data points.

Calculating the $\bar{x}$ and $\bar{y}$: Each worker will add $x/y$ values of 10 data points and divide this by 10 to arrive at partial mean of $x/y$ (this is the map function). Then the Spark master will run the aggregation step (in Spark DAG of the job) where the partial means from all 10 workers are taken and again added, then divided by 10 to arrive at the final $\bar{x}$ or $\bar{y}$ (the aggregate/reduce operation)

Calculating the $(x-\bar{x}) \cdot (y-\bar{y})$: Same way, distribute the data points, broadcast the $\bar{x}$ and $\bar{y}$ values to all the workers and the calculate the partial $(x-\bar{x}) \cdot (y-\bar{y})$, again run aggregation to get $\sum (x-\bar{x})(y-\bar{y})$

The above method is used for distributed calculation, you shall get the covariance, for multi-dimensional data, you can get the covariance matrix.

The point is to distribute the calculation for stages that can be distributed and then centralize the calculation stages that cannot be distributed. That is in effect one of the important aspect of Spark architecture.

Hope this helps.

다른 팁

If you want to see how Spark does it, look at the org.apache.spark.mllib.linalg.distributed.RowMatrix class, starting with the computePrincipalComponentsAndExplainedVariance method.

The part of it that is actually distributed is in the computeGramianMatrix method, which accumulates each input vector into a Gramian matrix using BLAS.spr(1.0, v, U.data) where v is an input vector, and U represents the upper triangular part of the matrix. This can be run concurrently on many executors and then the partially-aggregated matrices can be combined by adding the matrices together.

Once all the vectors have been aggregated into the Gramian matrix, it converts the matrix to a covariance matrix, and then uses SVD to produce the PCA matrix/vector. However this final stage is not distributed.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange