Question

I have a large dataset and have defined outliers to be those values which fall either above the 99th or below the 1st percentile.

I'd like to take the mean of those outliers with their previous and following datapoints, then replace all 3 values with that average in a new dataset.

If there's anyone who knows how to do this I'd be very grateful for a response.

Was it helpful?

Solution

If you have a list of indices specifying the outliers location in the vector, e.g. using:

out_idx = which(df$value > quan0.99)

You can do something like:

for(idx in out_idx) {
  vec[(idx-1):(idx+1)] = mean(vec[(idx-1):(idx+1)])
}

You can wrap this in a function, making the bandwith and the function an optional parameter:

average_outliers = function(vec, outlier_idx, bandwith, func = "mean") {
   # iterate over outliers
   for(idx in out_idx) {
    # slicing of arrays can be used for extracting information, or in this case,
    # for assiging values to that slice. do.call is used to call the e.g. the mean 
    # function with the vector as input.
    vec[(idx-bandwith):(idx+bandwith)] = do.call(func, out_idx[(idx-bandwith):(idx+bandwith)])
  }      
  return(vec)
}

allowing you to also use median with a bandwith of 2. Using this function:

# Call average_outliers multiple times on itself,
# first for the 0.99 quantile, then for the 0.01 quantile.
vec = average_outliers(vec, which(vec > quan0.99))
vec = average_outliers(vec, which(vec < quan0.01))

or:

vec = average_outliers(vec, which(vec > quan0.99), bandwith = 2, func = "median")
vec = average_outliers(vec, which(vec < quan0.01), bandwith = 2, func = "median")

to use a bandwith of 2, and replace with the median value.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top