Gracefully removing observations with outliers in N fields

https://datascience.stackexchange.com/questions/45903

01-11-2019
|

Question

I have a function.

remove_outliers <- function(x, na.rm = TRUE, ...) {

    #find position of 1st and 3rd quantile not including NA's
    qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)

    H <- 1.5 * IQR(x, na.rm = na.rm)

    y <- x
    y[x < (qnt[1] - H)] <- NA
    y[x > (qnt[2] + H)] <- NA
    x<-y

    #get rid of any NA's
    x[!is.na(x)]
}

Given a dataset(numbers) like this:

The functioning is obvious

remove_outliers(numbers)

means I now have this:

However, what if I have an ID that I want to retain, such as:

number_id    numbers
12              5
23              9
34              2
45              99
56              3
67              4

How do I remove the outlier(99) with the remove_outliers function(or another, better suited function), to get this data:

number_id    numbers
12              5
23              9
34              2
56              3
67              4

(note the entire observation with the outlier has been removed)

And how can I scale this solution to handle n more variables?

I can do it very ungracefully by taking out each column separately and building a new data frame with loops, but it's hardly readable and a mess to debug. Is there a more graceful way?

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange