Question

I tried this question in stats.stackexchange and somebody suggested I try it over here, so here goes:

I've completed PCA analysis, in R with VEGAN package, of some ecological data on tree health. There are 80 trees total (so, 80 'sites') divided into four treatment categories. I've got the data plotted with color coded points--colors according to the treatment groups. Rather than plotting individual sites/trees on PCA biplot, I'd like to make something like a box-and-whisker plot that has four 'crosses' that show the centroid for each group and the SE in both PCA dimensions. I've seen figures like this in papers, but I can't seem to find an R script for plotting this way. Any suggestions? (I'd like to post an example image here of what I'm looking for, but the ones I can find are all paywalled, sorry).

I guess an alternative would be to just take the site scores and manually find the means and SE's and create my own plot, but I'd rather find a script for it, if possible.

The code I've been running is really straightforward:

p1<-princomp(scale(health, scale=T))
summary(p1)
scores(p1)
plot(p1)
loadings(p1)
biplot(p1, xlab = "PC 1 (38%)", ylab = "PC 2 (22%)",cex=0.6)
plot(p1$scores[,1],p1$scores[,2])
names(p1)

plot(p1$scores[,1],p1$scores[,2], type='n', xlab="PC I", ylab="PC II")
text(p1$scores[,1],p1$scores[,2] labels=Can$tree)
Was it helpful?

Solution

I would probably start with ordiellipse and see if that suits your needs.

### Reproducible example
require("vegan")
data(varespec)
data(varechem)

pca <- rda(varespec, scale = TRUE)
grp <- with(varechem, cut(Baresoil, 4, labels = 1:4))
cols <- c("red","orange","blue","forestgreen")
scl <- 1 ## scaling

plot(pca, display = "sites", scaling = scl, type = "n")
points(pca, display = "sites", scaling = scl, col = cols[grp], pch = 16)
lev <- levels(grp)
for (i in seq_along(lev)) { ## draw ellipse per group
  ordiellipse(pca, display = "sites", kind = "se", scaling = scl,
              groups = grp, col = cols[i], show.groups = lev[i])
}
## centroids
scrs <- as.data.frame(scores(pca, display = "sites", scaling = scl, 
                             choices = 1:2))
cent <- do.call(rbind, lapply(split(scrs, grp), colMeans))
points(cent, col = cols, pch = 3, cex = 1.1)

This produces

enter image description here

You can remove the points() line from the code above to stop it drawing the actual samples, but I thought it instructive here in terms of understanding what ordiellipse is doing.

In the plot, the centroid is marked as the mean of the site scores on each axis by grouping grp. The ellipse is a continuous region that is (given the setting I chose in the ordiellipse() call) 1 standard error about this centroid. Your suggestion for error bars in each direction is a specific case of the ellipse drawn by ordiellipse --- if you were to compute the standard errors of the centroids, they should extend as far as the extremal points of the ellipse in the horizontal and vertical directions.

However, this would fail to take into account the covariances of the scores on the two axes. Note in the example below how in those ellipses that are oriented at angles to the axes, the standard error bars do not intersect with the ellipse at their extremal points. If you were to draw a box containing the the region defined by the error bars, it would contain the ellipse, but it gives a very different impression of the uncertainty in the centroid.

serrFun <- function(df) {
  apply(df, 2, function(x) sd(x) / sqrt(length(x)))
}
serr <- do.call(rbind, lapply(split(scrs, grp), serrFun))
for (i in seq_along(lev)) {
    arrows(cent[i, 1] - serr[i, 1], cent[i, 2],
           cent[i, 1] + serr[i, 1], cent[i, 2],
           col = cols[i], code = 3, angle = 90, length = 0.05)
    arrows(cent[i, 1], cent[i, 2] - serr[i, 2],
           cent[i, 1], cent[i, 2] + serr[i, 2], 
           col = cols[i], code = 3, angle = 90, length = 0.05)
}

enter image description here

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top