Weighted Pearson's Correlation?

https://stackoverflow.com/questions/11557602

21-06-2021
|

Pregunta

I have a 2396x34 double matrix named y wherein each row (2396) represents a separate situation consisting of 34 consecutive time segments.

I also have a numeric[34] named x that represents a single situation of 34 consecutive time segments.

Currently I am calculating the correlation between each row in y and x like this:

crs[,2] <- cor(t(y),x)

What I need now is to replace the cor function in the above statement with a weighted correlation. The weight vector xy.wt is 34 elements long so that a different weight can be assigned to each of the 34 consecutive time segments.

I found the Weighted Covariance Matrix function cov.wt and thought that if I first scale the data it should work just like the cor function. In fact you can specify for the function to return a correlation matrix as well. Unfortunately it does not seem like I can use it in the same manner because I cannot supply my two variables (x and y) separately.

Does anyone know of a way I can get a weighted correlation in the manner I described without sacrificing much speed?

Edit: Perhaps some mathematical function could be applied to y prior to the cor function in order to get the same results that I'm looking for. Maybe if I multiply each element by xy.wt/sum(xy.wt)?

Edit #2 I found another function corr in the boot package.

corr(d, w = rep(1, nrow(d))/nrow(d))

d   
A matrix with two columns corresponding to the two variables whose correlation we wish to calculate.

w   
A vector of weights to be applied to each pair of observations. The default is equal weights for each pair. Normalization takes place within the function so sum(w) need not equal 1.

This also is not what I need but it is closer.

Edit #3 Here is some code to generate the type of data I am working with:

x<-cumsum(rnorm(34))
y<- t(sapply(1:2396,function(u) cumsum(rnorm(34))))
xy.wt<-1/(34:1)

crs<-cor(t(y),x) #this works but I want to use xy.wt as weight

Solución

You can go back to the definition of the correlation.

f <- function( x, y, w = rep(1,length(x))) {
  stopifnot( length(x) == dim(y)[2] )
  w <- w / sum(w)
  # Center x and y, using the weighted means
  x <- x - sum(x*w)
  y <- y - apply( t(y) * w, 2, sum )
  # Compute the variance
  vx <- sum( w * x * x )
  vy <- rowSums( w * y * y ) # Incorrect: see Heather's remark, in the other answer
  # Compute the covariance
  vxy <- colSums( t(y) * x * w )
  # Compute the correlation
  vxy / sqrt(vx * vy)
}
f(x,y)[1]
cor(x,y[1,]) # Identical
f(x, y, xy.wt)

Otros consejos

Unfortunately the accepted answer is wrong when y is a matrix of more than one row. The error is in the line

vy <- rowSums( w * y * y )

We want to multiply the columns of y by w, but this will multiply the rows by the elements of w, recycled as necessary. Thus

> f(x, y[1, , drop = FALSE], xy.wt)
[1] 0.103021

is correct, because in this case the multiplication is performed element-wise, which is equivalent to column-wise multiplication here, but

> f(x, y, xy.wt)[1]
[1] 0.05463575

gives a wrong answer due to the row-wise multiplication.

We can correct the function as follows

f2 <- function( x, y, w = rep(1,length(x))) {
  stopifnot(length(x) == dim(y)[2] )
  w <- w / sum(w)
  # Center x and y, using the weighted means
  x <- x - sum(x * w)
  ty <- t(y - colSums(t(y) * w))
  # Compute the variance
  vx <- sum(w * x * x)
  vy <- colSums(w * ty * ty)
  # Compute the covariance
  vxy <- colSums(ty * x * w)
  # Compute the correlation
  vxy / sqrt(vx * vy)
}

and check the results against those produced by corr from the boot package:

> res1 <- f2(x, y, xy.wt)
> res2 <- sapply(1:nrow(y), 
+                function(i, x, y, w) corr(cbind(x, y[i,]), w = w),
+                x = x, y = y, w = xy.wt)
> all.equal(res1, res2)
[1] TRUE

which in itself gives another way that this problem could be solved.

Here is a generalization to compute the weighted Pearson correlation between two matrices (instead of a vector and a matrix, as in the original question):

matrix.corr <- function (a, b, w = rep(1, nrow(a))/nrow(a)) 
{
    # normalize weights
    w <- w / sum(w)

    # center matrices
    a <- sweep(a, 2, colSums(a * w))
    b <- sweep(b, 2, colSums(b * w))

    # compute weighted correlation
    t(w*a) %*% b / sqrt( colSums(w * a**2) %*% t(colSums(w * b**2)) )
}

Using the above example and the correlation function from Heather, we can verify it:

> sum(matrix.corr(as.matrix(x, nrow=34),t(y),xy.wt) - f2(x,y,xy.wt))
[1] 1.537507e-15

In terms of calling syntax, this resembles the unweighted cor:

> a <- matrix( c(1,2,3,1,3,2), nrow=3)
> b <- matrix( c(2,3,1,1,7,3,5,2,8,1,10,12), nrow=3)
> matrix.corr(a,b)
     [,1]      [,2] [,3]      [,4]
[1,] -0.5 0.3273268  0.5 0.9386522
[2,]  0.5 0.9819805 -0.5 0.7679882
> cor(a, b)
     [,1]      [,2] [,3]      [,4]
[1,] -0.5 0.3273268  0.5 0.9386522
[2,]  0.5 0.9819805 -0.5 0.7679882

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow