Frage

I need to do some robust data-fitting operation.

I have bunch of (x,y) data, that I want to fit to a Gaussian (aka normal) function. The point is, I want to remove the ouliers. As one can see on the sample plot below, there is another distribution of data thats pollutting my data on the right, and I don't want to take it into account to do the fitting (i.e. to find \sigma, \mu and the overall scale parameter). sample data plot

R seems to be the right tool for the job, I found some packages (robust, robustbase, MASS for example) that are related to robust fitting.

However, they assume the user already has a strong knowledge of R, which is not my case, and the documentation is only provided as a sort of reference manual, no tutorial or equivalent. My statistical background is rather low, I attempted to read reference material on fitting with R, but it didn't really help (and I'm not even sure thats the right way to go). But I have the feeling that this is actually a quite simple operation.

I have checked this related question (and the linked ones), however they take as input a single vector of values, and I have a vector of pairs, so I don't see how to transpose.

Any help on how to do this would be appreciated.

War es hilfreich?

Lösung

Fitting a Gaussian curve to the data, the principle is to minimise the sum of squares difference between the fitted curve and the data, so we define f our objective function and run optim on it:

fitG =
function(x,y,mu,sig,scale){

  f = function(p){
    d = p[3]*dnorm(x,mean=p[1],sd=p[2])
    sum((d-y)^2)
  }

  optim(c(mu,sig,scale),f)
 }

Now, extend this to two Gaussians:

fit2G <- function(x,y,mu1,sig1,scale1,mu2,sig2,scale2,...){

  f = function(p){
    d = p[3]*dnorm(x,mean=p[1],sd=p[2]) + p[6]*dnorm(x,mean=p[4],sd=p[5])
    sum((d-y)^2)
  }
  optim(c(mu1,sig1,scale1,mu2,sig2,scale2),f,...)
}

Fit with initial params from the first fit, and an eyeballed guess of the second peak. Need to increase the max iterations:

> fit2P = fit2G(data$V3,data$V6,6,.6,.02,8.3,0.10,.002,control=list(maxit=10000))
Warning messages:
1: In dnorm(x, mean = p[1], sd = p[2]) : NaNs produced
2: In dnorm(x, mean = p[4], sd = p[5]) : NaNs produced
3: In dnorm(x, mean = p[4], sd = p[5]) : NaNs produced
> fit2P
$par
[1] 6.035610393 0.653149616 0.023744876 8.317215066 0.107767881 0.002055287

What does this all look like?

> plot(data$V3,data$V6)
> p = fit2P$par
> lines(data$V3,p[3]*dnorm(data$V3,p[1],p[2]))
> lines(data$V3,p[6]*dnorm(data$V3,p[4],p[5]),col=2)

enter image description here

However I would be wary about statistical inference about your function parameters...

The warning messages produced are probably due to the sd parameter going negative. You can fix this and also get a quicker convergence by using L-BFGS-B and setting a lower bound:

> fit2P = fit2G(data$V3,data$V6,6,.6,.02,8.3,0.10,.002,control=list(maxit=10000),method="L-BFGS-B",lower=c(0,0,0,0,0,0))
> fit2P
$par
[1] 6.03564202 0.65302676 0.02374196 8.31424025 0.11117534 0.00208724

As pointed out, sensitivity to initial values is always a problem with curve fitting things like this.

Andere Tipps

Fitting a Gaussian:

# your data
set.seed(0)
data <- c(rnorm(100,0,1), 10, 11) 

# find & remove outliers
outliers <- boxplot(data)$out
data <- setdiff(data, outliers)

# fitting a Gaussian
mu <- mean(data)
sigma <- sd(data)

# testing the fit, check the p-value
reference.data <- rnorm(length(data), mu, sigma)
ks.test(reference.data, data) 
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top