質問

I am very new with R and i used to refer a lot here in stackoverflow. I would like to compare each row with rest of the rows to calculate the modified similarity matrix.

mat <- matrix("", 10, 12)
mat[c(1, 4, 6),] <- sample(c("AA", "AB", "BB"), 18, TRUE)
mat[c(2, 3, 10),] <- sample(c("AA", "BB", "AB"), 18, TRUE)
mat[c(5, 8),] <- sample(c("BB", "AB", "BB"), 12, TRUE)
mat[c(7, 9),] <- sample(c("AA", "AA", "BB"), 12, TRUE)
mat[3,4] = 'NA'
mat[2,5] = 'NA'

this provides:

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
 [1,] "AA" "AA" "AB" "AA" "AA" "AA" "AA" "AA" "AB" "AA"  "AA"  "AA" 
 [2,] "AB" "AA" "BB" "BB" "NA" "AB" "AB" "AA" "BB" "BB"  "BB"  "AB" 
 [3,] "BB" "AA" "AB" "NA" "AA" "AA" "BB" "AA" "AB" "AA"  "AA"  "AA" 
 [4,] "AA" "AA" "BB" "AB" "AA" "AB" "AA" "AA" "BB" "AB"  "AA"  "AB" 
 [5,] "AB" "AB" "BB" "BB" "AB" "AB" "AB" "AB" "BB" "BB"  "AB"  "AB" 
 [6,] "AA" "AA" "AB" "AA" "AB" "AA" "AA" "AA" "AB" "AA"  "AB"  "AA" 
 [7,] "BB" "AA" "AA" "BB" "AA" "AA" "BB" "AA" "AA" "BB"  "AA"  "AA" 
 [8,] "AB" "BB" "BB" "BB" "AB" "BB" "AB" "BB" "BB" "BB"  "AB"  "BB" 
 [9,] "AA" "AA" "BB" "BB" "AA" "AA" "AA" "AA" "BB" "BB"  "AA"  "AA" 
[10,] "BB" "AB" "AA" "BB" "BB" "BB" "BB" "AB" "AA" "BB"  "BB"  "BB" 

I would like to compare each row with rest of the rows to calculate modified similarity matrix.

Step 1: Assign values by comparing two rows

AA Vs AA = 1;
AA Vs AB = 0.5;
AA Vs NA = 0.0;
NA Vs NA = 0.0;
AB Vs AA = 0.5;
AA Vs BB = 0.0;
AB Vs AB = 0.5

Step 2: Total the scores (example row 1 versus row 2 = 7.0)

Step 3: Count the total numbers excluding the instances where there is one or two 'NA' (example row 1 versus row 2 = 11.0),

Step 4: Divide the total scores by the count(example row 1 versus row2 7/11=0.636363)

Step 5: Do it for each rows and get the result in matrix populated in both diagonals (Example 10 X 10)

Thanks in Advance !

役に立ちましたか?

解決

I will change your matrix definition a bit to make "NA" characters into actual missing values (NA) which have a special meaning in R that is close to the behavior you want.

mat <- matrix("", 10, 12)
mat[c(1, 4, 6),] <- sample(c("AA", "AB", "BB"), 18, TRUE)
mat[c(2, 3, 10),] <- sample(c("AA", "BB", "AB"), 18, TRUE)
mat[c(5, 8),] <- sample(c("BB", "AB", "BB"), 12, TRUE)
mat[c(7, 9),] <- sample(c("AA", "AA", "BB"), 12, TRUE)
mat[3,4] <- NA
mat[2,5] <- NA

You also have not provided with the values of all possible matches, so I am going to make some assumptions. These values can be changed without breaking the code.

For step 1, I am going to make a named vector that can be indexed using the pair names bunched together. So AA vs BA becomes 'AABA'.

pair <- c('AAAA', 'AAAB', 'AABB', 'ABAB', 'ABBB', 'BBBB')
value <- c(1, 0.5, 0, 0.5, 0.5, 1)
# add reverse pairing (I am assuming symmetry)
pair <- c(pair, paste0(substr(pair, 3, 4), substr(pair, 1, 2)))
value <- c(value, value)
names(value) <- pair

Check how the vector value looks at this point to make sure it's what you want. Next we define a function that uses this globally defined vector and returns what you want at the end of step 4. You may want to include the vector definition in the function body, but I feel that would not be efficient.

compare <- function(row1, row2){
  # get total value of match from 2 vectors
  # get vector of complete cases (not having any NAs)
  good.cases <- complete.cases(cbind(row1, row2))
  na.cases <- length(row1) - good.cases
  total.value <- sum(value[paste0(row1, row2)], na.rm=TRUE) + 0.5*na.cases
  total.value/good.cases
}

At this point I get total.value of 6.5 from comparing the first 2 rows, but that is probably due to a wrong assumption in value.

For step 5, we use a double loop:

# start empty matrix with match values
n <- nrow(mat)
matches <- matrix(rep(NA, n*n), nrow=n)
for (i in 1:n){
  for (j in i:n){  ## if symmetric, only half matrix is enough
    matches[i, j] <- compare(mat[i, ], mat[j, ])
  }
}

I hope that helps.

Edit: Changed compare() to assign a value to NA cases after request in the comments.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top