how to detect outliers in the columns of a dataframe? in R

https://stackoverflow.com/questions/16089572

04-04-2022
|

Question

I have a data frame, suppose this:

names<-c("a","a","a","a","a","b","b","b","b","b","c","c","c","c","c","c","c","c")
var1<-c(0.942999593,0.935507266,0.973589623,0.969415912,0.95230801,0.935507266,0.888740961,0.91750551,0.944482672,0.945468585,1.457579147,0.922206277,0.941511433,0.954724791,0.941014244,0.941511433,0.941511433,1.50511433)
var2<-c(-0.012678088,0.014313763,0.001138275,-0.020568206,0.012987126,0.001217192,0.03360358,0.009758172,0.015066932,-0.037879492,0.020471157,0.010738162,0.010952531,0.019377213,0.027140572,0.031116892,-0.018530676,-8.90E-05)
as.data.frame(cbind(names,var1,var2))->df

I would like to convert the outliers to Na in the columns var1 and var2. However I would like to calculate the outliers independently for each category in the column "names". So the outliers for "a" in var1, will be the outliers found using just the first 5 rows in var1.

the way in which I detect the outlier is all values, below or above the quantiles 0.25 and 0.75 respectively.

Is there any easy way to do this in R?

thank you very much in advance.

Tina.

Solution

Here's how you can do it for var1:

quantiles<-tapply(var1,names,quantile)
minq <- sapply(names, function(x) quantiles[[x]]["25%"])
maxq <- sapply(names, function(x) quantiles[[x]]["75%"])
var1[var1<minq | var1>maxq] <- NA

Repeat the same for var2 (or df$var2).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow