Finding a curve to match data

https://stackoverflow.com/questions/1358238

20-09-2019
|

Question

I'm looking for a non-linear curve fitting routine (probably most likely to be found in R or Python, but I'm open to other languages) which would take x,y data and fit a curve to it.

I should be able to specify as a string the type of expression I want to fit.

Examples:

"A+B*x+C*x*x"
"(A+B*x+C*x*x)/(D*x+E*x*x)"
"sin(A+B*x)*exp(C+D*x)+E+F*x"

What I would get out of this is at least the values for the constants (A, B, C, etc.) And hopefully stats about the fitness of the match.

There are commercial programs to do this, but I expected to be able to find something as common as fitting to a desired expression in a language library nowadays. I suspect SciPy's optimization stuff might be able to do this, but I can't see that it lets me define an equation. Likewise, I can't seem to find exactly what I want in R.

Is what I'm looking for out there, or do I need to roll my own? I hate to do it if it's there and I'm just having trouble finding it.

Edit: I want to do this for a bit more control over the process than I get from LAB Fit. The LAB Fit UI is dreadful. I'd also like to be able to break the range into multiple pieces and have different curves represent the different pieces of the range. In the end, the result has to be able to (speed-wise) beat a LUT with linear interpolation or I'm not interested.

In my current set of problems, I have trig functions or exp() and I need to execute them 352,800 times per second in real time (and use only a fraction of the CPU). So I plot the curve and use the data to drive the curve fitter to get less expensive approximations. In the old days, LUTs were almost always the solution, but nowadays skipping the memory lookups and doing an approximation is sometimes faster.

Solution

To answer your question in a general sense (regarding parameter estimation in R) without considering the specifics of the equations you pointed out, I think you are looking for nls() or optim()... 'nls' is my first choice as it provides error estimates for each estimated parameter and when it fails I use 'optim'. If you have your x,y variables:

out <- tryCatch(nls( y ~ A+B*x+C*x*x, data = data.frame(x,y), 
                start = c(A=0,B=1,C=1) ) ,
                error=function(e) 
                optim( c(A=0,B=1,C=1), function(p,x,y)  
                      sum((y-with(as.list(p),A + B*x + C*x^2))^2), x=x, y=y) )

to get the coefficients, something like

getcoef <- function(x) if(class(x)=="nls") coef(x) else x$par
getcoef(out)

If you want the standard errors in the case of 'nls',

summary(out)$parameters

The help files and r-help mailing list posts contain many discussions regarding specific minimization algorithms implemented by each (the default used in each example case above) and their appropriateness for the specific form of the equation at hand. Certain algorithms can handle box constraints, and another function called constrOptim() will handle a set of linear constraints. This website may also help:

http://cran.r-project.org/web/views/Optimization.html

OTHER TIPS

Your first model is actually linear in the three parameters and can be fit in R using

 fit <- lm(y ~ x + I(x^2), data=X)

which will get you your three parameters.

The second model can also be fit using nls() in R with the usual caveats of having to provide starting values etc. The statistical issues in optimization are not necessarily the same as the numerical issues -- you cannot just optimize any functional form no matter which language you choose.

Check out GNU Octave - between its polyfit() and the nonlinear constraints solver it ought to be possible to construct something suitable for your problem.

In R, this is quite easy.

The built in method is called optim(). It takes as arguments a starting vector of potential parameters, then a function. You have to go build your own error function, but that's really simple.

Then you call it like out = optim( 1 , err_fn)

where err_fn is

err_fn = function(A) {
    diff = 0;
    for(i in 1:data_length){
      x = eckses[i];
      y = data[i];
      model_y = A*x;
      diff = diff + ( y - model_y )^2
    }
    return(diff);
}

This just assumes you have a vector of x and y values in eckses and data. Change the model_y line as you see fit, even add more parameters.

It works on nonlinear just fine, I use it for four dimensional e^x curves and it is very fast. The output data includes the error value at the end of the fitting, which is a measure of how well it fits, given as a sum of squared differences (in my err_fn).

EDIT: If you NEED to take in the model as a string, you can have your user interface construct this whole model fitting process as an R script and load it in to run. R can take text from STDIN or from a file, so it shouldn't be too hard to craft this function's string equivalent, and have it run optim automatically.

You're probably not going to find a single routine with the flexibility implied in your examples (polynomials and rational functions using the same routine), let alone one that will parse a string to figure out what sort of equation to fit.

A least-squares polynomial fitter would be appropriate for your first example. (It's up to you what degree polynomial to use -- quadradic, cubic, quartic, etc.). For a rational function like your second example, you might have to "roll your own" if you can't find a suitable library. Also, keep in mind that a sufficiently high-degree polynomial can be used to approximate your "real" function, as long as you don't need to extrapolate beyond the limits of the data set you're fitting to.

As others have noted, there are other, more generalized parameter estimation algorithms which might also prove useful. But those algorithms aren't quite "plug and play": they usually require you to write some helper routines, and supply a list of initial values for the model parameters. It's possible for these sorts of algorithms to diverge, or get stuck in a local minimum or maximum for an unlucky choice of initial parameter estimates.

if you have constraints on your coefficients, and you know that there is a specific type of function you'd want to fit to your data and that function is a messy one where standard regression methods or other curve fitting methods won't work, have you considered genetic algorithms?

they're not my first choice, but if you are trying to find the coefficients of the second function you mentioned, then perhaps GAs would work --- especially if you are using non-standard metrics to evaluate best fit. for example, if you wanted to find the coefficients of "(A+Bx+Cx^2)/(Dx+Ex^2)" such that the sum of square differences between your function and data is minimal and that there be some constraint on the arclength of the resulting function, then a stochastic algorithm might be a good way to approach this.

some caveats: 1) stochastic algorithms won't guarantee the best solution, but they will often be very close. 2) you have to be careful about the stability of the algorithm.

on a longer note, if you are at the stage where you want to find a function from some space of functions that best fits your data (e.g., you are not going to impose, say, the second model on your data), then genetic programming techniques may also help.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow