Why do cluster plot labels use rows instead of names from ID column?
-
12-12-2019 - |
Question
I am working with a data set (column 1=gene names and column 2 = expression values) and I'm trying to do a cluster plot but what I find is that the branches are labeled by row number rather than the gene ID from column 1.
dataset: https://dl.dropbox.com/u/364456/miRNA.csv
Using:
attach(animals)
d=dist(as.matrix(animals))
hc=hclust(d)
plot(hc)
resulting plot:
I've tried to do kmeans clustering and end up getting this error:
NAs introduced by coercion.
Which indicates to me that I have not formatted my data file correctly.
Anyone know what's going on here?
Solution
For hclust
to recognize your gene name as the correct label name, this column has to be the row names.
Problem: gene mmu-miR-191
appears twice and row names cannot be repeated. Considering the value for both rows are the same, I'm just gonna assume it is a duplicate and erase the second one.
read.table("miRNA.csv", sep=",", header=TRUE, row.names=1) -> mirna
mirna[-34,] -> mirna # Delete the redundant row.
row.names(mirna) <- mirna[,1] # Declare column 1 as the row names
dist(as.matrix(mirna)) -> d # And then your routine
hc <- hclust(d)
plot(hc)
OTHER TIPS
By default, the row numbers or row names are used to label the observations. However, you can use the labels argument to select a variable to use for the labels.
plot(modelname, labels=dataset$variable)