Simple matching similarity matrix for continuous, non-binary data?
-
10-06-2021 - |
Question
Given the matrix
structure(list(X1 = c(1L, 2L, 3L, 4L, 2L, 5L), X2 = c(2L, 3L,
4L, 5L, 3L, 6L), X3 = c(3L, 4L, 4L, 5L, 3L, 2L), X4 = c(2L, 4L,
6L, 5L, 3L, 8L), X5 = c(1L, 3L, 2L, 4L, 6L, 4L)), .Names = c("X1",
"X2", "X3", "X4", "X5"), class = "data.frame", row.names = c(NA,
-6L))
I want to create a 5 x 5 distance matrix with the ratio of matches and the total number of rows between all columns. For instance, the distance between X4 and X3 should be 0.5, given that both columns match 3 out of 6 times.
I have tried using dist(test, method="simple matching")
from package "proxy" but this method only works for binary data.
La solution
Using outer
(again :-)
my.dist <- function(x) {
n <- nrow(x)
d <- outer(seq.int(ncol(x)), seq.int(ncol(x)),
Vectorize(function(i,j)sum(x[[i]] == x[[j]]) / n))
rownames(d) <- names(x)
colnames(d) <- names(x)
return(d)
}
my.dist(x)
# X1 X2 X3 X4 X5
# X1 1.0000000 0.0000000 0.0 0.0 0.3333333
# X2 0.0000000 1.0000000 0.5 0.5 0.1666667
# X3 0.0000000 0.5000000 1.0 0.5 0.0000000
# X4 0.0000000 0.5000000 0.5 1.0 0.0000000
# X5 0.3333333 0.1666667 0.0 0.0 1.0000000
Autres conseils
Here's a shot at it (dt is your matrix):
library(reshape)
df = expand.grid(names(dt),names(dt))
df$val=apply(df,1,function(x) mean(dt[x[1]]==dt[x[2]]))
cast(df,Var2~Var1)
Here's a solution that is faster than the other two, though a bit ugly. I assume the speed bumps come from not using mean()
as it can be slow compared to sum()
, and also only computing half of the output matrix and then filling the lower triangle manually. The function currently leaves NA
on the diagonal, but you can easily set those to one to completely match the other answers with diag(out) <- 1
FUN <- function(m) {
#compute all the combinations of columns pairs
combos <- t(combn(ncol(m),2))
#compute the similarity index based on the criteria defined
sim <- apply(combos, 1, function(x) sum(m[, x[1]] - m[, x[2]] == 0) / nrow(m))
combos <- cbind(combos, sim)
#dimensions of output matrix
out <- matrix(NA, ncol = ncol(m), nrow = ncol(m))
for (i in 1:nrow(combos)){
#upper tri
out[combos[i, 1], combos[i, 2]] <- combos[i,3]
#lower tri
out[combos[i, 2], combos[i, 1]] <- combos[i,3]
}
return(out)
}
I took the other two answers, made them into functions, and did some benchmarking:
library(rbenchmark)
benchmark(chase(m), flodel(m), blindJessie(m),
replications = 1000,
order = "elapsed",
columns = c("test", "elapsed", "relative"))
#-----
test elapsed relative
1 chase(m) 1.217 1.000000
2 flodel(m) 1.306 1.073131
3 blindJessie(m) 17.691 14.548520
I have got the answer as follows: 1st I have made some modifications on the row data as:
X1 = c(1L, 2L, 3L, 4L, 2L, 5L)
X2 = c(2L, 3L, 4L, 5L, 3L, 6L)
X3 = c(3L, 4L, 4L, 5L, 3L, 2L)
X4 = c(2L, 4L, 6L, 5L, 3L, 8L)
X5 = c(1L, 3L, 2L, 4L, 6L, 4L)
matrix_cor=rbind(x1,x2,x3,x4,x5)
matrix_cor
[,1] [,2] [,3] [,4] [,5] [,6]
X1 1 2 3 4 2 5
X2 2 3 4 5 3 6
X3 3 4 4 5 3 2
X4 2 4 6 5 3 8
X5 1 3 2 4 6 4
then:
dist(matrix_cor)
X1 X2 X3 X4
X2 2.449490
X3 4.472136 4.242641
X4 5.000000 3.000000 6.403124
X5 4.358899 4.358899 4.795832 6.633250
Thank you all for your suggestions. Based on your answers I elaborated a three line solution ("test" is the name of the dataset).
require(proxy)
ff <- function(x,y) sum(x == y) / NROW(x)
dist(t(test), ff, upper=TRUE)
Here is the output:
X1 X2 X3 X4 X5
X1 0.0000000 0.0000000 0.0000000 0.3333333
X2 0.0000000 0.5000000 0.5000000 0.1666667
X3 0.0000000 0.5000000 0.5000000 0.0000000
X4 0.0000000 0.5000000 0.5000000 0.0000000
X5 0.3333333 0.1666667 0.0000000 0.0000000