Question

I have no idea how to tackle this problem, the only thing I can think of is a brute force loop, but I'm not even sure how to loop through the rows of a data.table in a sensible way.

I have a double keyed data.table and a correlation matrix based on the first of those keys. I need to build the full correlation matrix for all elements, by looking up the correlation for any given pair, which is zero if the second key doesn't match.

Simplified Example:

library(data.table)
DT = data.table(Key1 = c("A", "A", "A", "B", "B", "C", "C"), Key2 = c(1,2,3,2,3,3,4), OtherData = "Irrelevant")
setkey(DT, Key2, Key1)
M = matrix(c(1.0, 0.4, 0.3, 
             0.4, 1.0, 0.2, 
             0.3, 0.2, 1.0), nrow = 3)

So our starting data.table looks like:

> DT
   Key1 Key2  OtherData
1:    A    1 Irrelevant
2:    A    2 Irrelevant
3:    B    2 Irrelevant
4:    A    3 Irrelevant
5:    B    3 Irrelevant
6:    C    3 Irrelevant
7:    C    4 Irrelevant

And the pre-defined correlation matrix for the A, B & C when they share the same Key2 value, is given by M:

> M
     [,1] [,2] [,3]
[1,]  1.0  0.4  0.3
[2,]  0.4  1.0  0.2
[3,]  0.3  0.2  1.0

And I now need to make a 7x7 matrix that would look like:

> result
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]  1.0    0    0    0    0    0    0
[2,]    0  1.0  0.4    0    0    0    0
[3,]    0  0.4  1.0    0    0    0    0
[4,]    0    0    0  1.0  0.4  0.3    0
[5,]    0    0    0  0.4  1.0  0.2    0
[6,]    0    0    0  0.3  0.2  1.0    0
[7,]    0    0    0    0    0    0  1.0

Where we have created the block diagonal matrix using the parts of M that match the Key1 values available at each Key2 (Key2 is effectively time).

Was it helpful?

Solution

Here's one way (not sure how it scales though):

rownames(M) <- colnames(M) <- LETTERS[1:3]
ans <- DT[, list(idx1=.I, idx2=rep(.I, each=.N), 
            val=as.vector(M[Key1, Key1])), by=Key2]
dcast.data.table(ans, idx2 ~ idx1, value.var="val", fill=0L)

#    idx2 1   2   3   4   5   6 7
# 1:    1 1 0.0 0.0 0.0 0.0 0.0 0
# 2:    2 0 1.0 0.4 0.0 0.0 0.0 0
# 3:    3 0 0.4 1.0 0.0 0.0 0.0 0
# 4:    4 0 0.0 0.0 1.0 0.4 0.3 0
# 5:    5 0 0.0 0.0 0.4 1.0 0.2 0
# 6:    6 0 0.0 0.0 0.3 0.2 1.0 0
# 7:    7 0 0.0 0.0 0.0 0.0 0.0 1

dcast.data.table is available from data.table versions >= 1.9.0. The current stable CRAN version at the time of writing is 1.9.2.

OTHER TIPS

This does what you want:

1.Set up the data.

DT <- data.frame(Key1 = c("A", "A", "B", "A", "B", "C", "C"), Key2 = c(1, 2, 2, 3, 3, 3, 4))

M <- matrix(c(1, 0.4, 0.3, 0.4, 1, 0.2, 0.3, 0.2, 1), nrow = 3)

2.Subset the matrix, grouping by Key2 (this returns a list).

BD <- by(DT, DT$Key2, function(df) {N = as.numeric(df$Key1); M[N,N]})

3.Construct a block-diagonal matrix.

library(magic)

do.call(adiag, BD)

My data.table skills aren't that strong, but I came up with a solution that takes advantage of the indices, but only if I added the row numbers.

# DT$row<-1:nrow(DT) # No longer necessary.
# Add dimension names to matrix for convenience
rownames(M)<-colnames(M)<-c('A','B','C') 

f<-function(k1,k2) {
  # rows<-DT[.(k2)]$row 
  rows<-DT[.(k2),.I]$.I
  ret<-rep(0,nrow(DT))
  ret[rows]<-M[DT[.(k2)]$Key1,k1]
  ret
}

mapply(f,DT$Key1,DT$Key2)
#      A   A   B   A   B   C C
# [1,] 1 0.0 0.0 0.0 0.0 0.0 0
# [2,] 0 1.0 0.4 0.0 0.0 0.0 0
# [3,] 0 0.4 1.0 0.0 0.0 0.0 0
# [4,] 0 0.0 0.0 1.0 0.4 0.3 0
# [5,] 0 0.0 0.0 0.4 1.0 0.2 0
# [6,] 0 0.0 0.0 0.3 0.2 1.0 0
# [7,] 0 0.0 0.0 0.0 0.0 0.0 1

This should be a little better in the sense that the indices will be called. More efficient solutions might take advantage of the known diagonal nature of the output matrix. I wonder if there is a way to do this without adding the row numbers? The comment below indicated one way of getting the row number, I have implemented it above.

This is edited to use native data.table() features - hopefully it should perform better!

# make the cor matrix into an expand.grid equivalent - all combos - using CJ for cross join
cor_list<-data.table(CJ(LETTERS[1:nrow(M)],LETTERS[1:nrow(M)]))
# fill with the values for M
cor_list[,cor:=unlist(as.list(M))]
# index on combination of correlation inputs
setkey(cor_list, V1, V2)

# lookup correlation for all combos of DT v DT
DTX<-DT[,cor_list[J(Key1,DT[,Key1],DT[,Key2])],by=c("Key1","Key2")]
# index on Key2
setkey(DTX,Key2)
# Set cor=0 where Key2 doesn't match (OK, it's a bit of a hack!)
DTX[Key2!=V3,cor:=0]

# fill a matrix with the vector of correlations (it fits)
# original length of DT gives you the length of side 
matrix(DTX[,cor],nrow(DT))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    1  0.0  0.0  0.0  0.0  0.0    0
[2,]    0  1.0  0.4  0.0  0.0  0.0    0
[3,]    0  0.4  1.0  0.0  0.0  0.0    0
[4,]    0  0.0  0.0  1.0  0.4  0.3    0
[5,]    0  0.0  0.0  0.4  1.0  0.2    0
[6,]    0  0.0  0.0  0.3  0.2  1.0    0
[7,]    0  0.0  0.0  0.0  0.0  0.0    1

EDITED ABOVE - DOUBLE APPLY SLOW AS PER ROLAND'S COMMENT

How about this?

#function to return letter corresponding to number
lookup_letter<-function(let){match(let,matrix(c("A","B","C")))}

then nest 2 apply calls for each dimension of the matrix

apply(DT,1,function(x){                # call row-wise
  apply(DT,1,function(y)ifelse(y[2]==x[2],M[lookup_letter(x[1]),lookup_letter(y[1])],0))   # call column-wise lookup
  })

     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    1  0.0  0.0  0.0  0.0  0.0    0
[2,]    0  1.0  0.4  0.0  0.0  0.0    0
[3,]    0  0.4  1.0  0.0  0.0  0.0    0
[4,]    0  0.0  0.0  1.0  0.4  0.3    0
[5,]    0  0.0  0.0  0.4  1.0  0.2    0
[6,]    0  0.0  0.0  0.3  0.2  1.0    0
[7,]    0  0.0  0.0  0.0  0.0  0.0    1

Probably there are better ways to lookup your correlation number, but this gives you an idea (maybe flatten M into an indexed list)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top