R Language - Sorting data into ranges; averaging; ignore outliers

https://stackoverflow.com/questions/4843194

r
outliers

27-10-2019
|

Question

I am analyzing data from a wind turbine, normally this is the sort of thing I would do in excel but the quantity of data requires something heavy-duty. I have never used R before and so I am just looking for some pointers.

The data consists of 2 columns WindSpeed and Power, so far I have arrived at importing the data from a CSV file and scatter-plotted the two against each other.

What I would like to do next is to sort the data into ranges; for example all data where WindSpeed is between x and y and then find the average of power generated for each range and graph the curve formed.

From this average I want recalculate the average based on data which falls within one of two standard deviations of the average (basically ignoring outliers).

Any pointers are appreciated.

For those who are interested I am trying to create a graph similar to this. Its a pretty standard type of graph but like I said the shear quantity of data requires something heavier than excel.

Solution

Throw this version, similar in motivation as @hadley's, into the mix using an additive model with an adaptive smoother using package mgcv:

Dummy data first, as used by @hadley

w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
df <- data.frame(power = power, w_sp = w_sp)

Fit the additive model using gam(), using an adaptive smoother and smoothness selection via REML

require(mgcv)
mod <- gam(power ~ s(w_sp, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)

Predict from our model and get standard errors of fit, use latter to generate an approximate 95% confidence interval

x_grid <- with(df, data.frame(w_sp = seq(min(w_sp), max(w_sp), length = 100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
x_grid <- within(x_grid, upr <- fit + 2 * pred$se.fit)
x_grid <- within(x_grid, lwr <- fit - 2 * pred$se.fit)

Plot everything and the Loess fit for comparison

plot(power ~ w_sp, data = df, col = "grey")
lines(fit ~ w_sp, data = x_grid, col = "red", lwd = 3)
## upper and lower confidence intervals ~95%
lines(upr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
lines(lwr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
## add loess fit from @hadley's answer
lines(x_grid$w_sp, predict(loess(power ~ w_sp, data = df), x_grid), col = "blue",
      lwd = 3)

adaptive smooth and loess fits

OTHER TIPS

Since you're no longer in Excel, why not use a modern statistical methodology that doesn't require crude binning of the data and ad hoc methods to remove outliers: locally smooth regression, as implemented by loess.

Using a slight modification of csgillespie's sample data:

w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)

plot(w_sp, power)

x_grid <- seq(0, 100, length = 100)
lines(x_grid, predict(loess(power ~ w_sp), x_grid), col = "red", lwd = 3)

First we will create some example data to make the problem concrete:

w_sp = sample(seq(0, 100, 0.01), 1000)
power = 1/(1+exp(-(rnorm(1000, mean=w_sp, sd=5) -40)/5))

Suppose we want to bin the power values between [0,5), [5,10), etc. Then

bin_incr = 5
bins = seq(0, 95, bin_incr)
y_mean = sapply(bins, function(x) mean(power[w_sp >= x & w_sp < (x+bin_incr)]))

We have now created the mean values between the ranges of interest. Note, if you wanted the median values, just change mean to median. All that's left to do, is to plot them:

plot(w_sp, power)
points(seq(2.5, 97.5, 5), y_mean, col=3, pch=16)

To get the average based on data that falls within two standard deviations of the average, we need to create a slightly more complicated function:

noOutliers = function(x, power, w_sp, bin_incr) {
  d = power[w_sp >= x & w_sp < (x + bin_incr)]
  m_d = mean(d)
  d_trim = mean(d[d > (m_d - 2*sd(d)) & (d < m_d + 2*sd(d))])
  return(mean(d_trim))
}

y_no_outliers = sapply(bins, noOutliers, power, w_sp, bin_incr)

Here are some examples of fitted curves (weibull analysis) for commercial turbines:

http://www.inl.gov/wind/software/

http://www.irec.cmerp.net/papers/WOE/Paper%20ID%20161.pdf

http://www.icaen.uiowa.edu/~ie_155/Lecture/Power_Curve.pdf

I'd recommend also playing around with Hadley's own ggplot2. His website is a great resource: http://had.co.nz/ggplot2/ .

    # If you haven't already installed ggplot2:
    install.pacakges("ggplot2", dependencies = T)

    # Load the ggplot2 package
    require(ggplot2)

    # csgillespie's example data
    w_sp <- sample(seq(0, 100, 0.01), 1000)
    power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)

    # Bind the two variables into a data frame, which ggplot prefers
    wind <- data.frame(w_sp = w_sp, power = power)

    # Take a look at how the first few rows look, just for fun
    head(wind)


    # Create a simple plot
    ggplot(data = wind, aes(x = w_sp, y = power)) + geom_point() + geom_smooth()

    # Create a slightly more complicated plot as an example of how to fine tune
    # plots in ggplot
    p1 <- ggplot(data = wind, aes(x = w_sp, y = power))
    p2 <- p1 + geom_point(colour = "darkblue", size = 1, shape = "dot") 
    p3 <- p2 + geom_smooth(method = "loess", se = TRUE, colour = "purple")
    p3 + scale_x_continuous(name = "mph") + 
             scale_y_continuous(name = "power") +
             opts(title = "Wind speed and power")

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow