How to interpret the loading values of a pca?

https://datascience.stackexchange.com/questions/14300

16-10-2019
|

Pergunta

Imagine I've the following matrix, which gives the grades of students in the subjects German, Philosophy, Math and Physics:

ger = c(2,4,1,3,2,4,4,1,2,3)
phi = c(3,4,1,2,2,3,3,2,2,2)
mat = c(1,3,2,4,1,2,2,4,3,1)
phy = c(2,2,2,5,2,2,3,4,3,3)
A = cbind(ger,phi,mat,phy)

I combine everything to a matrix and scale the data:

As = scale(A)

Now, I perform a summary on the PCA:

summary(princomp(As), loadings = TRUE)

Which returns the following output:

Importance of components:
                       Comp.1    Comp.2     Comp.3     Comp.4
Standard deviation     1.3257523 1.1657791 0.59600603 0.35793402
Proportion of Variance 0.4882275 0.3775114 0.09867311 0.03558799
Cumulative Proportion  0.4882275 0.8657389 0.96441201 1.00000000

Loadings:
     Comp.1 Comp.2 Comp.3 Comp.4
ger  0.496 -0.502  0.519  0.482
phi  0.548 -0.443 -0.423 -0.570
mat  -0.430 -0.572 -0.546  0.435
phy  -0.518 -0.474  0.503 -0.503

I have a few hints for the first component (based on the loadings):

There is a high positive correlation between german and philosophy and there is also a high positive correlation between math and physics.
Who is good in language (german and philosophy) is often worse in MINT (math and physics) and the other way around.

And an idea about the second one, which I cannot interpret:

It's a weighted arithmetic mean over all four variables.

But I have no idea how to interpret the Comp. 2, Comp. 3 and Comp. 4 based on the loadings. Especially because all values of Comp. 2 are all negative, or have the same orientation. Can someone help me? Thanks in advance!

Solução

The columns of your loadings matrix are a basis of orthonormal eigenvectors. This is an important concept from linear algebra, and well worth learning about in detail if you're not familiar. But for the purposes of this answer it can be understood as defining a system of coordinates.

For each student, we can define a point in a four-dimensional space (specifically, in $\mathbb{R}^4$) which represents their grades (after centering and normalization). Or to put it another way, you can imagine the set of all students' grades as a scatterplot in four dimensions, with four perpendicular axes. We can orient these axes in various directions (just as we can in two or three dimensions). The most obvious choice is to have one axis for each subject, so the axis which is collinear with the unit vector pointing from the origin to the point $(1,0,0,0)$ represents their grade in German, and likewise the axis which is collinear with the vector $(0,1,0,0)$ represents their grade in Philosophy, the axis which is collinear with the vector $(0,0,1,0)$ represents their grade in Math, and the axis which is collinear with the vector $(0,0,0,1)$ represents their grade in Physics.

However, there's no reason to expect that the direction in which our scatterplot is most spread out (the direction of greatest variance in the data) will align with one of these axes. PCA picks out a new set of axes so that one axis aligns with the direction of greatest variance, and another aligns with the direction of the greatest remaining variance after the first direction is projected out, and so forth. The unit vectors (expressed in the original coordinate system) which point along these new axes are the columns in your loadings matrix.

In the case of this particular example, the loading vector for the first principal component is along an axis that basically expresses whether they're better at Math and Physics, or better at German and Philosophy. The loading vector for the second principal component is along an axis that basically expresses how good or bad a student they are over all (hence all the components of the vector have the same sign and similar magnitude). You wondered about the negative sign on all four components - if you're familiar with eigenvectors you'll know that changing all components of the vector by an overall sign is irrelevant. Basically, it's the same as just swapping which end of the axis we call positive and which we call negative.

So in this case the first two loading vectors are fairly close to what many of us might have expected to see. But even in this fairly intuitive example, you shouldn't be surprised that the loading vectors for the later principal components don't seem as obvious to you. That's because these are only addressing the variance that remains after we project out the variance that's explained by the first two factors. We all probably know that students who are good at Physics tend to be good at Math, but how many of us know (for example) if, after controlling for how good they are at Physics, the ones who are also better at Philosophy than German will be better at Math? These subtler effects will be less obvious to a causal observer than the dominant effects.

Once you get to the loading vector for the fourth principal component (out of four), you really don't need to wonder at all about why it has the particular value that it has. In fact, this vector was entirely determined by the previous three (up to the irrelevant overall sign) . This can be understood by remembering that PCA picked out four perpendicular axes in a four-dimensional space - once the first three are specified, there's only one remaining possible choice that's perpendicular to all of them.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange