Gaussian mixture modeling with mle2/optim

Question

Mixture models present a lot of technical challenges (symmetry under relabeling of components, etc.); unless you have very specific needs, you might be better off using one of the large number of special-purpose mixture modeling packages that have been written for R (just library("sos"); findFn("{mixture model}") or findFn("{mixture model} Gaussian")).

However, in this case, you have a more specific problem, which is that the goodness-of-fit/likelihood surface of the xsplit parameter is "bad" (i.e. the derivative is zero almost everywhere). In particular, if you consider a pair of points x1, x2 in your data set that are neighbours, the likelihood is exactly the same for any splitting parameter between x1 and x2 (because any of those values splits the data set into the same two components). That means the likelihood surface is piecewise flat, which makes it almost impossible for any sensible optimizer -- even those such as Nelder-Mead that don't explicitly depend on derivatives. Your choices are (1) use some sort of brute-force stochastic optimizer (such as method="SANN" in optim()); (2) take xsplit out of your function and profile over it (i.e. for each possible choice of xsplit, optimize over the other four parameters); (3) smooth your splitting criterion (i.e. fit a logistic probability of belonging to one component or the other); (4) use a special-purpose mixture model fitting algorithm, as recommended above.

set.seed(1001)
library(bbmle)
x1 = rnorm(n=100,mean=4,sd=0.8)
x2 = rnorm(n=100,mean=12,sd=0.4)
x = c(x1,x2)

Your ff function can be written more compactly:

## ff can be written more compactly:
ff2 <- function(m1,m2,sd1,sd2,xsplit) {
    p <- xvals<=xsplit
    -sum(dnorm(xvals,mean=ifelse(p,m1,m2),
               sd=ifelse(p,sd1,sd2),log=TRUE))
}

## ML estimation
mo <- mle2(ff2,
           start=list(m1=1,m2=2,sd1=0.1,sd2=0.1,xsplit=9), 
           data=list(xvals=x))

## refit with a different starting value for xsplit
mo2 <- update(mo,start=list(m1=1,m2=2,sd1=0.1,sd2=0.1,xsplit=4))

## not used here, but maybe handy
plotfun <- function(mo,xvals=x,sizes=c(40,90)) {
    c <- coef(mo)
    hist(xvals,col="gray")
    p <- xvals <= c["xsplit"]
    y <- with(as.list(coef(mo)),
              dnorm(xvals,mean=ifelse(p,m1,m2),
                    sd=ifelse(p,sd1,sd2))*sizes[ifelse(p,1,2)])
    points(xvals,y,pch=20,cex=1.5,col=c("blue","red")[ifelse(p,1,2)])
}

plot(slice(mo),ylim=c(-0.5,10))
plot(slice(mo2),ylim=c(-0.5,10))

I cheated a little bit to extract just the xsplit parameter:

Likelihood surface around xsplit=9:

xsplit=9

Likelihood surface around xsplit=4:

xsplit=4

Also see p. 243 of Bolker 2008.

Update: smoothing

As I mentioned above, one solution is to make the boundary between the two mixture components smooth, or gradual, rather than sharp. I used a logistic function plogis() with midpoint at xsplit and a scale arbitrarily set to 2 (you could try to make it sharper; in principle you could make it an adjustable parameter, but if you do that you'll probably run into trouble again because the optimizer may want to make it infinite ...) In other words, rather that saying that all observations with x<xsplit are definitely in component 1 and all observations with x>xsplit are definitely in component 2, we say that observations that are equal to xsplit have a 50/50 probability of falling in either component, with the certainty of being in component 1 increasing as x decreases below xsplit. A logistic function with a very large scaling parameter approximates the sharp-split model previously attempted; generally you want to make the scaling parameter "large enough" to get a reasonable split and small enough not to run into numeric problems. (If you make the scale too large, the computed probabilities will underflow/overflow to 0 or 1 and you'll be back where you started...)

This is my second or third try; I had to do considerable fiddling (bounding values away from 0, or between 0 and 1, and fitting the standard deviations on a log scale), but the results seem reasonable. If I don't use clamp() on the logistic (plogis) function then I get 0 or 1 probabilities; if I don't use clamp() (one-sided) on the Normal probabilities then they can underflow to zero -- in either case I get infinite or NaN outcomes. Fitting the standard deviations on the log scale works better because one doesn't run into problems when the optimizer tries negative values for the standard deviation ...

 ## bound x values between lwr and upr
 clamp <- function(x,lwr=0.001,upr=0.999) {
     pmin(upr,pmax(lwr,x))
 }

 ff3 <- function(m1,m2,logsd1,logsd2,xsplit) {
     p <- clamp(plogis(2*(xvals-xsplit)))
     -sum(log((1-p)*clamp(dnorm(xvals,m1,exp(logsd1)),upr=Inf)+
                  p*clamp(dnorm(xvals,m2,exp(logsd2)),upr=Inf)))
 }
 xvals <- x
 ff3(1,2,0.1,0.1,4)                                 
 mo3 <- mle2(ff3,
           start=list(m1=1,m2=2,logsd1=-1,logsd2=-1,xsplit=4), 
           data=list(xvals=x))
 ## Coefficients:
 ##          m1          m2      logsd1      logsd2      xsplit 
 ##  3.99915532 12.00242510 -0.09344953 -1.13971551  8.43767997

The results look reasonable.