Identifying & removing outliers from PCA & QQ plots

Question 1

For some reason, I don't believe that the identify method is supported in the car package (the source of qqPlot())

Let's take a look at a PCA of the USArrests data...

pca <- prcomp(USArrests)

The plot of this using qqPlot is easy enough.

require(car)
qqPlot(pca$x[,2],pch = 20, col = c(rep("red", 33), rep("blue", 99)))

However, qqPlot() does not allow for point selection via identify().

identify(qqPlot(pca$x[,2],pch = 20, col = c(rep("red", 33), rep("blue", 99))))
# numeric(0)

You can, however, make use of qqnorm() in the stats package.

identify(qqnorm(pca$x[,2],pch = 20, col = c(rep("red", 33), rep("blue", 99))))

This will produce a less sophisticated graph, but you should be able to add a line and confidence intervals manually via qqline() (also in stats) and a little more math.

Question 2

You can try the identify method in R. Typically, run

identify(qqPlot(pca$x[,2],pch = 20, col = c(rep("red", 33), rep("blue", 99))))

and left-click on the points which you want to identify. The index of the points in the score vector should be the same as in the original data.

Question 3

You can also visualize influence using the fviz_pca_ind() function in the factoextra library, as follows:

require(factoextra)
pca = prcomp(mydata)
fviz_pca_ind(pca,
         col.ind = "contrib", # Color by contribution
         gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07") #assign gradient
         )

This automatically labels the individuals, and colours them by their influence.