In your case the data has long tail, which is expected for gene expression data (lognormal).
data <- read.table(file='http://pastebin.com/raw.php?i=ZaGkPTGm',
header=TRUE, row.names=1)
mat <- as.matrix(data[,-1]) # -1 removes the first column containing gene symbols
As you can see from the quantile distribution that the genes with the highest expression extend the range from 1.5 to above 300.
quantile(mat)
# 0% 25% 50% 75% 100%
# 0.000 0.769 1.079 1.544 346.230
When the hierarchical clustering is performed on unscaled data the resulting dendrogram may show bias towards the values with the highest expression, as seen in your example. This merits either a logarithmic or z-score transformation, among many (reference). Your dataset contains values == 0
, which is a problem for log-transformation since log(0)
is undefined.
Z-score transformation (reference) is implemented within heatmap.2
, but it's important to note that the function computes the distance matrix and runs clustering algorithm before scaling the data. Hence the option scale='row'
doesn't influence the clustering results, see my earlier post (differences in heatmap/clustering defaults in R) for more details.
I would propose that you scale your data before running heatmap.2
:
# scale function transforms columns by default hence the need for transposition.
z <- t(scale(t(mat)))
quantile(z)
# 0% 25% 50% 75% 100%
# -2.1843994 -0.6646909 -0.2239677 0.3440102 2.2640027
# set custom distance and clustering functions
hclustfunc <- function(x) hclust(x, method="complete")
distfunc <- function(x) dist(x,method="maximum")
# obtain the clusters
fit <- hclustfunc(distfunc(z))
clusters <- cutree(fit, 5)
# require(gplots)
pdf(file='heatmap.pdf', height=50, width=10)
heatmap.2(z, trace='none', dendrogram='row', Colv=F, scale='none',
hclust=hclustfunc, distfun=distfunc, col=greenred(256), symbreak=T,
margins=c(10,20), keysize=0.5, labRow=data$Gene.symbol,
lwid=c(1,0.05,1), lhei=c(0.03,1), lmat=rbind(c(5,0,4),c(3,1,2)),
RowSideColors=as.character(clusters))
dev.off()
Also, see the additional posts here and here, which explain how to set the layout of the heatmap via lmat
, lwid
and lhei
parameters.
The resulting heatmap is shown below (row and column labels are omitted):