Question

I am working with a data set (column 1=gene names and column 2 = expression values) and I'm trying to do a cluster plot but what I find is that the branches are labeled by row number rather than the gene ID from column 1.

dataset: https://dl.dropbox.com/u/364456/miRNA.csv

Using:

attach(animals)
d=dist(as.matrix(animals))
hc=hclust(d)
plot(hc)

resulting plot:

enter image description here

I've tried to do kmeans clustering and end up getting this error:

NAs introduced by coercion.

Which indicates to me that I have not formatted my data file correctly.

Anyone know what's going on here?

Was it helpful?

Solution

For hclust to recognize your gene name as the correct label name, this column has to be the row names.

Problem: gene mmu-miR-191 appears twice and row names cannot be repeated. Considering the value for both rows are the same, I'm just gonna assume it is a duplicate and erase the second one.

read.table("miRNA.csv", sep=",", header=TRUE, row.names=1) -> mirna
mirna[-34,] -> mirna  # Delete the redundant row.
row.names(mirna) <- mirna[,1] # Declare column 1 as the row names
dist(as.matrix(mirna)) -> d # And then your routine
hc <- hclust(d)
plot(hc)

enter image description here

OTHER TIPS

By default, the row numbers or row names are used to label the observations. However, you can use the labels argument to select a variable to use for the labels.

plot(modelname, labels=dataset$variable)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top