Question

I'm using R to perform an hierarchical clustering. As a first approach I used hclust and performed the following steps:

  1. I imported the distance matrix
  2. I used the as.dist function to transform it in a dist object
  3. I run hclust on the dist object

Here's the R code:

distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
hclust(d, "ward")

At this point I would like to do something similar with the function pvclust; however, I cannot because it's not possible to pass a precomputed dist object. How can I proceed considering that I'm using a distance not available among those provided by the dist function of R?

Was it helpful?

Solution

It's not clear to me whether you only have a distance matrix, or you computed it beforehand. In the former case, as already suggested by @Vincent, it would not be too difficult to tweak the R code of pvclust itself (using fix() or whatever; I provided some hints on another question on CrossValidated). In the latter case, the authors of pvclust provide an example on how to use a custom distance function, although that means you will have to install their "unofficial version".

OTHER TIPS

I've tested the suggestion of Vincent, you can do the following (my data set is a dissimilarity matrix):

# Import you data
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)

# Compute the eigenvalues
x <- cmdscale(d,1,eig=T)

# Plot the eigenvalues and choose the correct number of dimensions (eigenvalues close to 0)
plot(x$eig, 
   type="h", lwd=5, las=1, 
   xlab="Number of dimensions", 
   ylab="Eigenvalues")

# Recover the coordinates that give the same distance matrix with the correct number of dimensions    
x <- cmdscale(d,nb_dimensions)

# As mentioned by Stéphane, pvclust() clusters columns
pvclust(t(x))

If the dataset is not too large, you can embed your n points in a space of dimension n-1, with the same distance matrix.

# Sample distance matrix
n <- 100
k <- 1000
d <- dist( matrix( rnorm(k*n), nc=k ), method="manhattan" )

# Recover some coordinates that give the same distance matrix
x <- cmdscale(d, n-1)
stopifnot( sum(abs(dist(x) - d)) < 1e-6 )

# You can then indifferently use x or d
r1 <- hclust(d)
r2 <- hclust(dist(x)) # identical to r1
library(pvclust)
r3 <- pvclust(x)

If the dataset is large, you may have to check how pvclust is implemented.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top