Residuals calculation in series with few points (max 20)

https://stackoverflow.com/questions/17840593

r
loess

04-06-2022
|

Question

I'm using loess to calculate the residuals. I expect for the following (small series) to find a big value of the residual for the third point

    y <- c(5814, 6083, 17764, 6110, 6556)
    x <- c(14564, 14719, 14753, 14754, 15086)
    > residuals(loess(y ~ x))
            1             2             3             4             5 
 2.728484e-12 -9.094947e-13  3.637979e-12  3.637979e-12  0.000000e+00

In particular, loess gives the following output:

> loess(y ~ x)
Call:
loess(formula = y ~ x)

Number of Observations: 5 
Equivalent Number of Parameters: 5 
Residual Standard Error: Inf 
Warning messages:
1: In simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize,  :
  span too small.   fewer data values than degrees of freedom.
2: In simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize,  :
  pseudoinverse used at 14561
3: In simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize,  :
  neighborhood radius 191.61
4: In simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize,  :
  reciprocal condition number  0
5: In simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize,  :
  There are other near singularities as well. 1.1263e+005

There is probably a (very simple) reason that I'm missing now, but the above seems strange to me... why it "doesn't work" as expected in my case?

EDIT:

thanks to @Gavin Simpson who suggested me this link I found out in the package MASS the function rlm which gives exactly what I was hoping.In the mean time I also tried to use lowess with several iteration and its fitted values converged actually "better" (in this case) to my data:

library(MASS)
method_rlm <- rlm(x=x,y=y)
method_lowess <- lowess(x,y, iter=7, f=1)

df<-data.frame(x=x, y=y, rlm=method_rlm$fitted.values, lowess=method_lowess$y)

library(ggplot2)
ggplot(df) +
  geom_line(aes(x, y), color="red") +
  geom_line(aes(x, rlm), color="blue") +
  geom_line(aes(x, lowess), color="green") +
  geom_point(aes(x, y), color="red")

enter image description here

I also had a look to some timings and the difference is huge..

> microbenchmark(rlm(x=x,y=y), lowess(x,y, iter=7, f=1), times=1000)
Unit: microseconds
                          expr      min       lq    median        uq        max neval
             rlm(x = x, y = y) 6445.269 6663.972 6906.1350 9417.1895 271494.006  1000
 lowess(x, y, iter = 7, f = 1)  169.099  193.046  238.0085  273.9295   3900.493  1000

do you think this difference will be worth? I have million of small series like that ( with 5 to 20 points maximum and similar type of outliers)

Solution

There are 5 observations in the data and loess() is fitting a model with 5 degrees of freedom, hence it is able to fit the observed data perfectly and thence the small (effectively 0) residuals. loess() has sufficient freedom to interpolate your data exactly, but is not a useful summary of the data. Fit a simpler model.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow