Pergunta

I have implemented PCA algorithm and I understood it very well but still I have some questions. My code is below and it's very simple implementation.

import numpy as np

x = np.loadtxt('CCPP', delimiter=',')
row, column = x.shape

# Mean normalization
for i in range(column):
    x[:,i] = (x[:,i] - x[:,i].mean()) / (x[:,i].max() - x[:,i].min())

sigma = x.transpose().dot(x) / row
u, s, v = np.linalg.svd(sigma, 0)
z = x.dot(u[:,:3]) ## new features

new_x = z.dot(u[:,:3].transpose()) ##reconstruction

First Question

As you can see above my sigma variable is

x.transpose().dot(x) / row

It's giving to me an nxn matrix (n is number of features). But sigma's formula is $$\Sigma = \frac{1}{n} \sum_{i=1}^n x^{(i)} {x^{(i)}}^T$$

Why there is a summation symbol in formula? I mean, if I use this sigma formulation then sigma is going to be a number, not a matrix. I have to get nxn matrix, right? So is my sigma implementation correct? or am I missing something about the formula?

Second Question

When we are reconstructing X (at the bottom in the code), should new_x equal to my first X? I mean, I reduced dimension of the data set, then I reconstructed it, original dataset and reconstructed dataset must be the same, right? This is my second question.

Third Question

This one is easy. Should I use data compression for each of my dataset which has 1000, 100000 or more features? I mean, can I always use it? Is it a good choice to use it every time?

Foi útil?

Solução

Regarding First Question.

In the above formula, if I'm not wrong, x is a matrix of elements. So what the formula wants from you, is to sum all dot products of every line with it's transpose. This will give you scalar.

x = np.array([1, 2, 3, 4])
res = x.dot(x.transpose())
# res = 30

So my sugestion would be to change that line of code to:

for i in range(row):
    sigma += x[i].transpose().dot(x[i])
sigma = sigma/row

Second Question

Because you reduced the dimensionality, the x_new matrix will not be the same.

Third Question

When to use the PCA is a thing of domain problem. The point of dimensionality reduction is, to get new data set, which will not be as hard to process, but it will loose some information. So if you are your "result"/"time to process" is good, I don't think you should use it.

Outras dicas

First question: computing $\Sigma$

Actually, you perform the same computation but using a matrix operation instead of a scalar operation. You might be mislead by your notation of the feature matrix $X$ that you write $x$ because in fact $$X = (x_j^{(i)})_{i,j} =\Big(x^{(1)} ... x^{(i)} ... x^{(n)}\Big)^T$$ that is to say, $i$-th row of the matrix $X$ contains $x^{(i)}$ the $i$-th sample.

So, when you want to compute the empirical covariance matrix $\Sigma$ (which is indeed a $n \times n$ matrix), you have: $$\Sigma = \frac{1}{n} \sum_{i=1}^n x^{(i)} {x^{(i)}}^T = \frac{1}{n} X^T X$$ you can check that this is exactly the same computation.

In your code, you are actually computing directly $\Sigma$ using the matrix operation (i.e. $\Sigma = \frac{1}{n} X^T X$) and this does give you a $n \times n$ matrix. So, both your implementation and the formula is correct.

Second question: reconstructing $X$

Your new feature matrix $Z$ is a $n \times 3$ matrix according to your code. In fact, as your code does not show the original size of the feature space, we will see the two cases here:

  1. $X$ is also a $n \times 3$ matrix, then you do not perform any kind of dimension reduction and you should have $X_{new} = X$ (at least in theory, in practice you might have a very small numerical approximation error but it will basically be the same)
  2. $X$ is a $n \times d$ matrix with $d > 3$, then you do perform a dimension reduction and in this case you get rid of some information contained in the original data and in this case $X_{new} \neq X$

In your case, I guess that you do have $d > 3$ and so you're in the second situation.

Third question: when to apply PCA

Well that depends in fact...

First of all PCA performs an SVD which can become very expensive if you have a lot of features. In fact, there are several way to compute PCA, your way is the closer to the mathematical theory but in practice computing an SVD directly on $X$ avoids computing $X^T X$ which is expensive and allows to retrieve the same space. But in any case, in some situation, it can be expensive and time consuming and thus not practical.

As PCA sorts the new feature space according to the value of the eigenvalues, i.e. of the variance of the new directions, removing the last direction might in some cases remove some noise which can help. However, at the same time, it can throw away valuable discrimative information, this is why other methods such as LDA can be interesting.

So to summarize, the answer to the question is no, it is not a good choice to use it everytime, it depends of your problem.

Licenciado em: CC-BY-SA com atribuição
scroll top