Lingua R - Ordinamento dei dati in intervalli;media;ignorare i valori anomali

https://stackoverflow.com/questions/4843194

r
outliers

27-10-2019
|

Domanda

Sto analizzando i dati di una turbina eolica, normalmente questo è il genere di cose che farei in Excel, ma la quantità di dati richiede qualcosa di pesante. Non ho mai usato R prima e quindi sto solo cercando alcuni suggerimenti.

I dati sono composti da 2 colonne WindSpeed  e Power , finora sono arrivato a importare i dati da un file CSV e ho tracciato a dispersione i due l'uno contro l'altro.

Quello che vorrei fare dopo è ordinare i dati in intervalli; ad esempio tutti i dati in cui WindSpeed  è compreso tra x e y, quindi trova la media della potenza generata per ciascun intervallo e rappresenta graficamente la curva formata.

Da questa media voglio ricalcolare la media in base ai dati che rientrano in una delle due deviazioni standard della media (ignorando sostanzialmente i valori anomali).

Eventuali suggerimenti sono apprezzati.

Per coloro che sono interessati sto cercando di creare un grafico simile a questo < / a>. È un tipo di grafico piuttosto standard ma, come ho detto, la quantità di dati di taglio richiede qualcosa di più pesante di Excel.

Soluzione

Throw this version, similar in motivation as @hadley's, into the mix using an additive model with an adaptive smoother using package mgcv:

Dummy data first, as used by @hadley

w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
df <- data.frame(power = power, w_sp = w_sp)

Fit the additive model using gam(), using an adaptive smoother and smoothness selection via REML

require(mgcv)
mod <- gam(power ~ s(w_sp, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)

Predict from our model and get standard errors of fit, use latter to generate an approximate 95% confidence interval

x_grid <- with(df, data.frame(w_sp = seq(min(w_sp), max(w_sp), length = 100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
x_grid <- within(x_grid, upr <- fit + 2 * pred$se.fit)
x_grid <- within(x_grid, lwr <- fit - 2 * pred$se.fit)

Plot everything and the Loess fit for comparison

plot(power ~ w_sp, data = df, col = "grey")
lines(fit ~ w_sp, data = x_grid, col = "red", lwd = 3)
## upper and lower confidence intervals ~95%
lines(upr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
lines(lwr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
## add loess fit from @hadley's answer
lines(x_grid$w_sp, predict(loess(power ~ w_sp, data = df), x_grid), col = "blue",
      lwd = 3)

adaptive smooth and loess fits

Altri suggerimenti

Since you're no longer in Excel, why not use a modern statistical methodology that doesn't require crude binning of the data and ad hoc methods to remove outliers: locally smooth regression, as implemented by loess.

Using a slight modification of csgillespie's sample data:

w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)

plot(w_sp, power)

x_grid <- seq(0, 100, length = 100)
lines(x_grid, predict(loess(power ~ w_sp), x_grid), col = "red", lwd = 3)

First we will create some example data to make the problem concrete:

w_sp = sample(seq(0, 100, 0.01), 1000)
power = 1/(1+exp(-(rnorm(1000, mean=w_sp, sd=5) -40)/5))

Suppose we want to bin the power values between [0,5), [5,10), etc. Then

bin_incr = 5
bins = seq(0, 95, bin_incr)
y_mean = sapply(bins, function(x) mean(power[w_sp >= x & w_sp < (x+bin_incr)]))

We have now created the mean values between the ranges of interest. Note, if you wanted the median values, just change mean to median. All that's left to do, is to plot them:

plot(w_sp, power)
points(seq(2.5, 97.5, 5), y_mean, col=3, pch=16)

To get the average based on data that falls within two standard deviations of the average, we need to create a slightly more complicated function:

noOutliers = function(x, power, w_sp, bin_incr) {
  d = power[w_sp >= x & w_sp < (x + bin_incr)]
  m_d = mean(d)
  d_trim = mean(d[d > (m_d - 2*sd(d)) & (d < m_d + 2*sd(d))])
  return(mean(d_trim))
}

y_no_outliers = sapply(bins, noOutliers, power, w_sp, bin_incr)

Here are some examples of fitted curves (weibull analysis) for commercial turbines:

http://www.inl.gov/wind/software/

http://www.irec.cmerp.net/papers/WOE/Paper%20ID%20161.pdf

http://www.icaen.uiowa.edu/~ie_155/Lecture/Power_Curve.pdf

I'd recommend also playing around with Hadley's own ggplot2. His website is a great resource: http://had.co.nz/ggplot2/ .

    # If you haven't already installed ggplot2:
    install.pacakges("ggplot2", dependencies = T)

    # Load the ggplot2 package
    require(ggplot2)

    # csgillespie's example data
    w_sp <- sample(seq(0, 100, 0.01), 1000)
    power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)

    # Bind the two variables into a data frame, which ggplot prefers
    wind <- data.frame(w_sp = w_sp, power = power)

    # Take a look at how the first few rows look, just for fun
    head(wind)


    # Create a simple plot
    ggplot(data = wind, aes(x = w_sp, y = power)) + geom_point() + geom_smooth()

    # Create a slightly more complicated plot as an example of how to fine tune
    # plots in ggplot
    p1 <- ggplot(data = wind, aes(x = w_sp, y = power))
    p2 <- p1 + geom_point(colour = "darkblue", size = 1, shape = "dot") 
    p3 <- p2 + geom_smooth(method = "loess", se = TRUE, colour = "purple")
    p3 + scale_x_continuous(name = "mph") + 
             scale_y_continuous(name = "power") +
             opts(title = "Wind speed and power")

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow