R语言-将数据分类为范围；平均忽略异常值

https://stackoverflow.com/questions/4843194

r
outliers

27-10-2019
|

题

我正在分析来自风力涡轮机的数据，通常这是我在excel中要做的事情，但是数据量需要一些繁重的工作。我以前从未使用过R，所以我只是在寻找一些指针。

数据由两列 WindSpeed 和 Power 组成，到目前为止，我已经从CSV文件导入数据，并且将两者相互散布了。

我接下来要做的是将数据分类为范围；例如， WindSpeed 在x和y之间的所有数据，然后找到每个范围产生的平均功率并绘制曲线。

根据这个平均值，我想根据落在平均值的两个标准偏差之一内的数据重新计算平均值（基本上忽略了离群值）。

任何指针都值得赞赏。

对于有兴趣的人，我尝试创建类似于此。它是一种非常标准的图形类型，但是就像我说的那样，数据的剪切量需要比excel重的东西。

解决方案

将此动机与@hadley的动机类似，使用加性模型，使用包mgcv的自适应平滑器将其加入混合：

第一个虚拟数据，如@hadley所使用通用标签

使用gam()拟合加性模型，并通过REML进行自适应平滑和平滑选择通用标签

根据我们的模型进行预测并获得拟合的标准误差，使用后者可以生成大约95％的置信区间通用标签

绘制一切，黄土适合比较通用标签

“自适应光滑和黄土拟合”

其他提示

既然您不再使用Excel，为什么不使用不需要统计数据粗化和临时方法消除异常值的现代统计方法：由黄土实现的局部平滑回归。

对csgillespie的示例数据进行一些修改：通用标签

First we will create some example data to make the problem concrete:

w_sp = sample(seq(0, 100, 0.01), 1000)
power = 1/(1+exp(-(rnorm(1000, mean=w_sp, sd=5) -40)/5))

Suppose we want to bin the power values between [0,5), [5,10), etc. Then

bin_incr = 5
bins = seq(0, 95, bin_incr)
y_mean = sapply(bins, function(x) mean(power[w_sp >= x & w_sp < (x+bin_incr)]))

We have now created the mean values between the ranges of interest. Note, if you wanted the median values, just change mean to median. All that's left to do, is to plot them:

plot(w_sp, power)
points(seq(2.5, 97.5, 5), y_mean, col=3, pch=16)

To get the average based on data that falls within two standard deviations of the average, we need to create a slightly more complicated function:

noOutliers = function(x, power, w_sp, bin_incr) {
  d = power[w_sp >= x & w_sp < (x + bin_incr)]
  m_d = mean(d)
  d_trim = mean(d[d > (m_d - 2*sd(d)) & (d < m_d + 2*sd(d))])
  return(mean(d_trim))
}

y_no_outliers = sapply(bins, noOutliers, power, w_sp, bin_incr)

Here are some examples of fitted curves (weibull analysis) for commercial turbines:

http://www.inl.gov/wind/software/

http://www.irec.cmerp.net/papers/WOE/Paper%20ID%20161.pdf

http://www.icaen.uiowa.edu/~ie_155/Lecture/Power_Curve.pdf

I'd recommend also playing around with Hadley's own ggplot2. His website is a great resource: http://had.co.nz/ggplot2/ .

    # If you haven't already installed ggplot2:
    install.pacakges("ggplot2", dependencies = T)

    # Load the ggplot2 package
    require(ggplot2)

    # csgillespie's example data
    w_sp <- sample(seq(0, 100, 0.01), 1000)
    power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)

    # Bind the two variables into a data frame, which ggplot prefers
    wind <- data.frame(w_sp = w_sp, power = power)

    # Take a look at how the first few rows look, just for fun
    head(wind)


    # Create a simple plot
    ggplot(data = wind, aes(x = w_sp, y = power)) + geom_point() + geom_smooth()

    # Create a slightly more complicated plot as an example of how to fine tune
    # plots in ggplot
    p1 <- ggplot(data = wind, aes(x = w_sp, y = power))
    p2 <- p1 + geom_point(colour = "darkblue", size = 1, shape = "dot") 
    p3 <- p2 + geom_smooth(method = "loess", se = TRUE, colour = "purple")
    p3 + scale_x_continuous(name = "mph") + 
             scale_y_continuous(name = "power") +
             opts(title = "Wind speed and power")

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow