R - check consistency of group assignment, group labels with different names

https://stackoverflow.com/questions/5323847

25-10-2019
|

Question

I am trying to assign sub-group membership in 4 independent cancer gene expression datasets, training on each dataset in turn, followed by testing the (metagene based) assignment in the remaining three, plus testing on the training cohort itself.

This produces group memberships for each sample, for each comparison and you can gain an idea about sample stability (does a given sample cluster within the same cluster each time?) The problem is that the group labels can differ from comparison to comparison, so comparing against group labels doesn't work.

In order to assess sample stability, I think I will need, for each sample, to catalogue its fellow subgroup members, but I haven't been able to conceptualise how precisely I should do this.

For what its worth, the code below should demonstrate the problem a little more clearly than I have described above.

Thanks for reading, and any help is appreciated!

## Here we have 12 samples (A-L), all of which have congruent assignments, except sample K.
## From the two group assignments, we can see that group 1 has become group 4 in class2,
## group 2 has become group 1 etc. etc.

## How do we assess cluster membership with these differing subgroup labels?

class1<-c(1,2,3,4,1,2,3,4,1,2,3,4)
class2<-c(4,1,2,3,4,1,2,3,4,1,3,3)

names(class1)<-LETTERS[1:12]
names(class2)<-LETTERS[1:12]

Solution

Try matchClasses in e1071, or some of the methods in the seriation package might help. You need the full two way table of classifications though.

OTHER TIPS

Nice question. Thank you for framing the question so clearly. I am working on clustering myself at the moment, and parked this question for solving later.

Here is a graphical way of solving the problem.

library(ggplot2)
# Create dummy data
# In the first instance, there is perfect transposition between A and D
d <- data.frame(
    clust1 = LETTERS[rep(1:4, 3)],
    clust2 = LETTERS[rep(c(4,1,2,3), 3)]
)
ggplot(d, aes(x=clust1, y=clust2)) + geom_point(stat="sum", aes(size=..n..))

Perfect transposition - all bubbles same size

# Now modify data so that there is a single instance of imperfect matching
d$clust2[1] <- "A"
ggplot(d, aes(x=clust1, y=clust2)) + geom_point(stat="sum", aes(size=..n..))

Imperfect transposition - bubbles different sizes

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow