Using matlab to calculate area under first peak

Question 1

So initially a mixed Gaussian approach looked really promising. The issue is that as well as noisy data, the signal source actually varies into a couple of distinct cases, such that I'd often find one combination of Gaussians which worked on one data-set would fail (drastically) on another.

Getting around this was possible, but the more general solutions introduced drift/bias into the approximations which had an inconsistent impact depending on both noise and the underlying case.

After faffing with this for a while, I opted to try matlab's curvedspline instead. This ended up providing a much better approach, which I then combined with some multidimensional cluster analysis to pick out places where the spine fitting had clearly gone awry. Using this meant that rather than fitting to bad data (i.e. data which gave serious deviations from the bulk data) I was able to discard these outliers. Specifically, I used domain knowledge to work out cases where, by definition, outliers were a result of a poor fit and not sample variance. This actually only lead to a couple of data points per sample being discarded (1-2 out of 20) and gave pretty clean results in the end.

Question 2

If you have the original data, work with a mixture of Gaussians instead of a histogram as your density approximation. Then the estimated density will be a smooth function (linear combination of Gaussian densities) and you can easily find stationary points and compute the mass on any given interval. A simple and easily-programmed method for computing the mixture parameters is the so-called EM (estimation-maximization) algorithm. Searching for "mixture of Gaussians" and/or "EM" should turn up a lot of hits, and perhaps working Matlab code as well.

If you don't have the original data, I have some other ideas.

Question 3

@Robert Dodier is correct, but does not seem to know about the gmdistribution built-in for MatLab.

If you fit a gaussian mixture to your data, then all you need to do is determine which component has the largest weight, and read the mean and variance of that component.

The spline smooth has a bias problem. It also gives non-physical results like negative probability density. The GMM has a better "basis".

Now personally I like using ecdf and fitting the cdf analytic form. This gives me optimal binning (and in images I can get huge increase in compute speed) and reduces the effects of centered noise.

[1] http://www.mathworks.com/help/stats/gmdistribution.fit.html
[2] http://www.mathworks.com/help/stats/ecdf.html
[3] http://www.mathworks.com/help/curvefit/custom-nonlinear-models.html