Remove outliers from correlation coefficient calculation

https://stackoverflow.com/questions/4666590

10-10-2019
|

Question

Assume we have two numeric vectors x and y. The Pearson correlation coefficient between x and y is given by

cor(x, y)

How can I automatically consider only a subset of x and y in the calculation (say 90%) as to maximize the correlation coefficient?

Solution

If you really want to do this (remove the largest (absolute) residuals), then we can employ the linear model to estimate the least squares solution and associated residuals and then select the middle n% of the data. Here is an example:

Firstly, generate some dummy data:

require(MASS) ## for mvrnorm()
set.seed(1)
dat <- mvrnorm(1000, mu = c(4,5), Sigma = matrix(c(1,0.8,1,0.8), ncol = 2))
dat <- data.frame(dat)
names(dat) <- c("X","Y")
plot(dat)

Next, we fit the linear model and extract the residuals:

res <- resid(mod <- lm(Y ~ X, data = dat))

The quantile() function can give us the required quantiles of the residuals. You suggested retaining 90% of the data, so we want the upper and lower 0.05 quantiles:

res.qt <- quantile(res, probs = c(0.05,0.95))

Select those observations with residuals in the middle 90% of the data:

want <- which(res >= res.qt[1] & res <= res.qt[2])

We can then visualise this, with the red points being those we will retain:

plot(dat, type = "n")
points(dat[-want,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[want,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

The plot produced from the dummy data showing the selected points with the smallest residuals

The correlations for the full data and the selected subset are:

> cor(dat)
          X         Y
X 1.0000000 0.8935235
Y 0.8935235 1.0000000
> cor(dat[want,])
          X         Y
X 1.0000000 0.9272109
Y 0.9272109 1.0000000
> cor(dat[-want,])
         X        Y
X 1.000000 0.739972
Y 0.739972 1.000000

Be aware that here we might be throwing out perfectly good data, because we just choose the 5% with largest positive residuals and 5% with the largest negative. An alternative is to select the 90% with smallest absolute residuals:

ares <- abs(res)
absres.qt <- quantile(ares, prob = c(.9))
abswant <- which(ares <= absres.qt)
## plot - virtually the same, but not quite
plot(dat, type = "n")
points(dat[-abswant,], col = "black", pch = 21, bg = "black", cex = 0.8)
points(dat[abswant,], col = "red", pch = 21, bg = "red", cex = 0.8)
abline(mod, col = "blue", lwd = 2)

With this slightly different subset, the correlation is slightly lower:

> cor(dat[abswant,])
          X         Y
X 1.0000000 0.9272032
Y 0.9272032 1.0000000

Another point is that even then we are throwing out good data. You might want to look at Cook's distance as a measure of the strength of the outliers, and discard only those values above a certain threshold Cook's distance. Wikipedia has info on Cook's distance and proposed thresholds. The cooks.distance() function can be used to retrieve the values from mod:

> head(cooks.distance(mod))
           1            2            3            4            5            6 
7.738789e-04 6.056810e-04 6.375505e-04 4.338566e-04 1.163721e-05 1.740565e-03

and if you compute the threshold(s) suggested on Wikipedia and remove only those that exceed the threshold. For these data:

> any(cooks.distance(mod) > 1)
[1] FALSE
> any(cooks.distance(mod) > (4 * nrow(dat)))
[1] FALSE

none of the Cook's distances exceed the proposed thresholds (not surprising given the way I generated the data.)

Having said all of this, why do you want to do this? If you are just trying to get rid of data to improve a correlation or generate a significant relationship, that sounds a bit fishy and bit like data dredging to me.

OTHER TIPS

Using method = "spearman" in cor will be robust to contamination and is easy to implement since it only involves replacing cor(x, y) with cor(x, y, method = "spearman").

Repeating Prasad's analysis but using Spearman correlations instead we find that the Spearman correlation is indeed robust to the contamination here, recovering the underlying zero correlation:

set.seed(1)

# x and y are uncorrelated
x <- rnorm(1000)
y <- rnorm(1000)
cor(x,y)
## [1] 0.006401211

# add contamination -- now cor says they are highly correlated
x <- c(x, 500)
y <- c(y, 500)
cor(x, y)
## [1] 0.995741

# but with method = "spearman" contamination is removed & they are shown to be uncorrelated
cor(x, y, method = "spearman")
## [1] -0.007270813

This may have been already obvious to the OP, but just to make sure... You have to be careful because trying to maxmimize correlation may actually tend to include outliers. (@Gavin touched on this point in his answer/comments.) I would be first removing outliers, then calculating a correlation. More generally, we want to be calculating a correlation that is robust to outliers (and there are many such methods in R).

Just to illustrate this dramatically, let's create two vectors x and y that are uncorrelated:

set.seed(1)
x <- rnorm(1000)
y <- rnorm(1000)
> cor(x,y)
[1] 0.006401211

Now let's add an outlier point (500,500):

x <- c(x, 500)
y <- c(y, 500)

Now the correlation of any subset that includes the outlier point will be close to 100%, and the correlation of any sufficiently large subset that excludes the outlier will be close to zero. In particular,

> cor(x,y)
[1] 0.995741

If you want to estimate a "true" correlation that is not sensitive to outliers, you might try the robust package:

require(robust)
> covRob(cbind(x,y), corr = TRUE)
Call:
covRob(data = cbind(x, y), corr = TRUE)

Robust Estimate of Correlation: 
            x           y
x  1.00000000 -0.02594260
y -0.02594260  1.00000000

You can play around with parameters of covRob to decide how to trim the data. UPDATE: There is also the rlm (robust linear regression) in the MASS package.

Here's another possibility with the outliers captured. Using a similar scheme as Prasad:

library(mvoutlier)    
set.seed(1)    
x <- rnorm(1000)    
y <- rnorm(1000)    
xy <- cbind(x, y)    
outliers <- aq.plot(xy, alpha=0.975) #The documentation/default says alpha=0.025.  I think the functions wants 0.975   
cor.plot(x, y)    
color.plot(xy)   
dd.plot(xy)   
uni.plot(xy)

In the other answers, 500 was stuck on the end of x and y as an outlier. That may, or may not cause a memory problem with your machine, so I dropped it down to 4 to avoid that.

x1 <- c(x, 4)     
y1 <- c(y, 4)    
xy1 <- cbind(x1, y1)    
outliers1 <- aq.plot(xy1, alpha=0.975) #The documentation/default says alpha=0.025.  I think the functions wants 0.975
cor.plot(x1, y1)    
color.plot(xy1)    
dd.plot(xy1)    
uni.plot(xy1)

Here are the images from the x1, y1, xy1 data:

alt text

You might try bootstrapping your data to find the highest correlation coefficient, e.g.:

x <- cars$dist
y <- cars$speed
percent <- 0.9         # given in the question above
n <- 1000              # number of resampling
boot.cor <- replicate(n, {tmp <- sample(round(length(x)*percent), replace=FALSE); cor(x[tmp], y[tmp])})

And after run max(boot.cor). Do not be dissapointed if all the correlation coefficients will be all the same :)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow