How to find significant correlations in a large dataset

Question 1

Here's some sample data for reproducibility.

m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))

You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.

correlations <- vapply(
  the_data[, -1],
  function(x)
  {
    cor(the_data[, 1], x)
  },
  numeric(1)
)

You can then find the column with the largest magnitude of correlation with y using:

correlations[which.max(abs(correlations))]

So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.

Question 2

You can use the function rcorr from the package Hmisc.

Using the same demo data from Richie:

m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))

Then:

library(Hmisc)
correlations <- rcorr(as.matrix(the_data))

To access the p-values:

correlations$P

To visualize you can use the package corrgram

library(corrgram)
corrgram(the_data)

Which will produce: enter image description here

Question 3

In order to print a list of the significant correlations (p < 0.05), you can use the following.

Using the same demo data from @Richie:

m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))

Install Hmisc
```
install.packages("Hmisc")
```

Import library and find the correlations (@Carlos)

library(Hmisc)
correlations <- rcorr(as.matrix(the_data))

Loop over the values printing the significant correlations

for (i in 1:m){
  for (j in 1:m){
    if ( !is.na(correlations$P[i,j])){
      if ( correlations$P[i,j] < 0.05 ) {
        print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
      }
    }
  }
}

Warning

You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.

Question 4

If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y. To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator

Question 5

One option is to run a correlation matrix:

cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")

This correlates all the variables in the file against each other and outputs a matrix.