similarity index in a list of character vectors

https://stackoverflow.com/questions/23188012

r
similarity

06-07-2023
|

Вопрос

I have a list that looks like this one:

$`264`
[1] "CHAMP1" "MAP1S"  "PRRC1"  "TUT1"   "CDK12" 

$`265`
[1] "TUT1"   "PRRC1"  "CHAMP1" "MAP1S"

$`266`
[1] "REPS1"  "CHAMP1" "PRRC1"  "TUT1"   "MAP1S" 

$`267`
[1] "G3BP1"  "TUT1"   "PRRC1"  "CHAMP1" "MAP1S" 

$`268`
[1] "TUT1"   "CHAMP1" "PRRC1"  "MAP1S"  

$`269`
[1] "DDB1"   "CHAMP1" "TUT1"   "PRRC1"  "MAP1S"

Is there any package or function to calculate the similarity among the different list components?

Many thanks

Решение

I'm not aware of any packages, but this implements your own metric (from your comment):

siml  <- function(x,y) {
  length(intersect(lst[[x]],lst[[y]]))/length(union(lst[[x]],lst[[y]]))
}
z      <- expand.grid(x=1:length(lst),y=1:length(lst))
result <- mapply(siml,z$x,z$y)
dim(result) <- c(length(lst),length(lst))
result
#       [,1] [,2]  [,3]  [,4] [,5]  [,6]
# [1,] 1.000  0.8 0.667 0.667  0.8 0.667
# [2,] 0.800  1.0 0.800 0.800  1.0 0.800
# [3,] 0.667  0.8 1.000 0.667  0.8 0.667
# [4,] 0.667  0.8 0.667 1.000  0.8 0.667
# [5,] 0.800  1.0 0.800 0.800  1.0 0.800
# [6,] 0.667  0.8 0.667 0.667  0.8 1.000

This is a (slightly) more efficient way to do the same thing:

result <- sapply(lst,function(x) 
            sapply(lst,function(y,x)length(intersect(x,y))/length(union(x,y)),x))

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow