Which variables combine to form most of the variance for a principal component in PCA?

https://stackoverflow.com/questions/19723993

02-07-2022
|

Question

I get how PCA works and how to implement it in Matlab, but I'm at a loss to find out which variables contribute most strongly to a principle component.

My questions is, suppose I have a data set of variables A,B,C,D,E,F. Unknown to me, variables A,B,C,E measure almost the same thing, and variables D, F both measure a different thing. There is little correlation between variables from the set (A,B,C,E), and set (D,F).

PCA tells me that there are 2 main principle components, which I know how to do. I do not know how to identify that A,B,C,E and D,F are two groups of variables measuring the same things within that group. Any advice on this would be greatly appreciated.

Solution

First let's create some data that behaves as you described - four variables that measure something similar, and two factors that measure something else.

>> x = randn(100, 1);
>> y = randn(100, 1);
>> v = [[x,x,x,x] + 0.1*randn(100,4), [y,y]  + 0.1*randn(100,2)];

Now find the principal components with a call to pca

>> [coeff, scores, latent, tsq, explained] = pca(v);

By looking at the variable latent we can see that the first two principle components are dominant

>> latent
latent =
    5.4821
    2.0491
    0.0120
    0.0106
    0.0089
    0.0073

Now, by looking at the first two rows of coeff (which contain the loadings of each of your six variables on the first two factors) it is clear that variables 1-4 load heavily on the first factor (in blue) and variables 5-6 load heavily on the second factor (in red).

>> bar(coeff(1:2, :)')

enter image description here

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow