Add sample names to PCA plotted with s.class

https://stackoverflow.com/questions/20584587

01-09-2022
|

Question

I'm trying to build a pca plot with the s.class function of ade4 package. I have a data set containing the abundance of bacterial species (rows) in different samples (columns). I need to perform some statistical tests and obtain some clusters of my samples, to represent them in a PCA. I ran this, following the scripts published on a paper:

>data=read.table("L4.txt",header=T,row.names=1,dec=".",sep="\t")

>data=data[-1,]

>library(cluster)

> JSD <- function(x,y) sqrt(0.5 * KLD(x, (x+y)/2) + 0.5 * KLD(y, (x+y)/2))

> KLD <- function(x,y) sum(x * log(x/y))

> dist.JSD <- function(inMatrix, pseudocount=0.000001, ...) {
  KLD <- function(x,y) sum(x *log(x/y))
  JSD<- function(x,y) sqrt(0.5 * KLD(x, (x+y)/2) + 0.5 * KLD(y, (x+y)/2))
  matrixColSize <- length(colnames(inMatrix))
  matrixRowSize <- length(rownames(inMatrix))
  colnames <- colnames(inMatrix)
  resultsMatrix <- matrix(0, matrixColSize, matrixColSize)

  inMatrix = apply(inMatrix,1:2,function(x) ifelse (x==0,pseudocount,x))

   for(i in 1:matrixColSize) {
   for(j in 1:matrixColSize) { 
   resultsMatrix[i,j]=JSD(as.vector(inMatrix[,i]),
   as.vector(inMatrix[,j]))
  }
   }
colnames -> colnames(resultsMatrix) -> rownames(resultsMatrix)
as.dist(resultsMatrix)->resultsMatrix
attr(resultsMatrix, "method") <- "dist"
return(resultsMatrix) 
}

>data.dist=dist.JSD(data)

>pam.clustering=function(x,k) { # x is a distance matrix and k the number of clusters
                      require(cluster)
                      cluster = as.vector(pam(as.dist(x), k, diss=TRUE)$clustering)
                      return(cluster)
                     }

>data.cluster=pam.clustering(data.dist,k=3)

>library(ade4)

>obs.pca=dudi.pca(data.frame(t(data)), scannf=F, nf=10)

>obs.bet=bca(obs.pca, fac=as.factor(data.cluster), scannf=F, nf=k-1)

>s.class(obs.bet$ls, fac=as.factor(data.cluster), grid=F)

My data looks like:

                       06.TO.VG   21.TO.V   02.TO.VG   41.TO.VG    30.TO.V
Actinomycetaceae     0.000000000 0.00000000 0.000000000 0.000000000 0.000000000
Bifidobacteriaceae   0.019108280 0.00000000 0.000000000 0.000787092 0.000000000
Coriobacteriaceae    0.060006705 0.01490985 0.002632445 0.003148367 0.008313539
Propionibacteriaceae 0.000000000 0.00000000 0.000000000 0.000000000 0.000000000
Bacteroidetes        0.000000000 0.00000000 0.000000000 0.000000000 0.000000000
Bacteroidia          0.157224271 0.02288488 0.010529780 0.002361276 0.005938242
Bacteroidaceae       0.072745558 0.01178918 0.028956894 0.059031877 0.097387173
Porphyromonadaceae   0.004022796 0.01005548 0.147745969 0.026367572 0.038004751
Prevotellaceae       0.083808247 0.30374480 0.586048042 0.487603306 0.290973872
Rikenellaceae        0.025477707 0.07836338 0.011187891 0.003935458 0.004750594

I had the plot I was looking for, but there aren't the sample names, so I don't know which samples are clustereing together. I also tried to build a vector containing the column names of my data (sampleID) and then pass it to the label option in s.class

> s.class(obs.bet$ls, fac=as.factor(data.cluster), grid=F, label=labs)

but I still haven't them....

Is there a way to get this with s.class function?

Thanks Francesca

Solution

There is no way to (directly) label the individual points using s.class(...). As you point out, the label= argument labels the clusters (point classes), not the individual points.

However, since s.class(...) uses the plot routines in base R, you can simply add a call to text(...):

labs <- rownames(obs.bet$ls)
s.class(obs.bet$ls,fac=factor(data.cluster),grid=F, xlim=c(-4,4))
text(obs.bet$ls,labels=labs,adj=c(-.1,-.8),cex=0.8)

There's a fair amount of tweaking of xlim=, adj=, and cex= to make the labels readable, but it does work.

Here are two alternatives that may be helpful (hopefully).

clusplot(obs.bet$ls,data.cluster,labels=2)

Produces this, which is very similar to your plot, with the points labeled.

Another alternative uses ggplot:

library(ggplot2)
gg <- cbind(obs.bet$ls,cluster=data.cluster)
gg <- cbind(sample=rownames(gg),gg)
ggplot(gg, aes(x=CS1, y=CS2)) + 
  geom_point(aes(color=factor(cluster)),size=5) + 
  geom_text(aes(label=sample),hjust=-0.2) + 
  geom_hline(yintercept=0,linetype=2) + 
  geom_vline(xintercept=0,linetype=2) +
  scale_color_discrete(name="Cluster") +
  xlim(-4,4)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow