Question

I am trying to recreate the procedure which is used for estimating smooth function from stat_smooth in ggplot2 package. Lets take an example:

library(ggplot2)

n <- 100
X <- runif(n)*8
Y <- sin(3*X) + cos(X^2) + rnorm(n, 0, 0.5)

myData <- as.data.frame(cbind(X, Y))

p <- ggplot(myData, aes(y=Y, x=X)) + 
     stat_smooth(se = FALSE, size = 2) +
     geom_point(size = 1)
p
geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

enter image description here

The smooth line doesn't really fit the data, but it doesn't matter. Now, lets recreate the same graph from scratch. According to http://www.inside-r.org/r-doc/stats/loess we need to use tricubic weighting kernel and polynomial of degree 2 (by default). I found this http://www.maths.manchester.ac.uk/~peterf/MATH38011/NPR%20local%20Linear%20Estimator.pdf article which describes how to estimate smooth loess function. I try to recreate this method and use it on my data:

Dfct <- function(t){
  if (abs(t) <= 1)
  ((1-abs(t)^3)^3) else
  0
  }

K_h <- function(x_0, x){
  f_hat <- NULL
  Dfct(abs(x - x_0)/h)
  }

m_hat_loess <- function(X, Y){
  e_1 <- c(1, 0, 0)
  m_hat <- NULL

  for(i in 1:length(X)){
  K_h_vector <- NULL

  for(j in 1:length(X)){
    K_h_vector <- c(K_h_vector, K_h(X[i], X[j]))
    }

  X_0 <- cbind(rep(1, length(X)), (X - X[i]), (X - X[i])^2)
  W <- diag(K_h_vector)

  m_hat <- c(m_hat,
  t(e_1)%*% solve(t(X_0)%*%W%*%X_0) %*% (t(X_0)%*%W%*%Y)
  )
}
m_hat
}

I am not sure what I should use for parameter h, but according to a book I have "For tri-cube kernel with metric width, h is the radius of the support region." Hence the first thing I try is:

h <- (max(X)-min(X))/2
Y_hat <- m_hat_loess(X, Y)

tempData <- as.data.frame(cbind(X, Y_hat))
ggplot(tempData , aes(x=X, y=Y_hat)) +
  geom_line(size = 2)

enter image description here

This is clearly not the same function. I have been using different values of h but couldn't recreate the same curve, which makes me believe that I made a mistake somewhere else.

Was it helpful?

Solution

The stat_smooth(...) function in the ggplot package merely passes your data (potentially subsetted) to the loess(...) function, as can be demonstrated here:

library(ggplot2)
set.seed(1)
n <- 100
X <- runif(n)*8
Y <- sin(3*X) + cos(X^2) + rnorm(n, 0, 0.5)

myData <- data.frame(X,Y)
fit <- loess(Y~X,data=myData)
myData$pred <- predict(fit)

ggplot(myData, aes(X,Y))+
  geom_point()+
  stat_smooth(se=F, size=3)+
  geom_line(aes(X,pred),colour="yellow")

The documentation for loess(...) provides references to the method of calculation, specifically, here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top