“Correlation & Significance if more than 30 pairs” using R and ddply

https://stackoverflow.com//questions/9720974

16-12-2019
|

Question

Part of the solution to my problem I found here: How to calculate correlation In R

set.seed(123)
X <- data.frame(ID = rep(1:2, each=5), a = sample(1:10), b = sample(1:10))
ddply(X, .(ID), summarize, cor_a_b = cor(a,b))

In addition to cor (which calculates Pearsons r) I calculate cor.test (for the p-value). But this fails in case of "not enough finite observations", so when some IDs are solo, which they are quite often in my case.

So I need to calculate r only if there are more than 30 or so pairs of data, if there are less I want NA.

Second problem is that the verbose output of cor.test inflates the resulting data frame - even if the only thing I wanted is the p-value. That is, if p actually is, what I understand it to be. Is it the significance of r?

I only know the t-test, to calculate the significance of r.

{Formula of the t-test-value: t = (r·(n-2)^0.5)/(1-r^2)^0.5) - but t is not the significance yet, otherwise I would try to implement the formula into the ddply statement}

Solution

try this:

> d <- data.frame(id = rep(1:3, c(5, 1, 10)), a = rnorm(16), b = rnorm(16))
> ddply(d, .(id), summarize, cor_a_b = if(length(id) < 3) {NA} else {cor.test(a, b)$p.value})
  id   cor_a_b
1  1 0.4393595
2  2        NA
3  3 0.5602855

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow