Question

I have a simple x,y data set to fit, at least at first glance. The issue is that scipy.optimize.curve_fit gives back a very large value for one of the parameters fitted and I don't know if this is mathematically correct or if there's something wrong with how I'm fitting the data.

The figure below shows both the data points and the best fit obtained in blue. The curve used (func in the MWE below) has four parameters a, b, c, d being fitted:

  • a gives approximately the x value where the curve attains it's half-maximum.
  • b represents the x value where the curve stabilizes. This func value is given by the d parameter, ie: func(b) = d
  • c is related to the maximum value of the curve at the origin: func(0) = c*constant + d
  • d is where the curve stabilizes (black line in the figure).

The b parameter is the one I'm having issues with (see end of question), it's also the one I'm most interested in assigning a reasonable value.

enter image description here

The MWE shows the function being fitted and the results:

import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

# Function to be fitted.
def func(x, a, b, c, d):
    return c * (1 / np.sqrt(1 + (np.asarray(x) / a) ** 2) -
        1 / np.sqrt(1 + (b / a) ** 2)) ** 2 + d

# Define x,y data.    
x_list = [12.5, 37.5, 62.5, 87.5, 112.5, 137.5, 162.5, 187.5, 212.5, 237.5,
    262.5, 287.5, 312.5, 337.5, 362.5, 387.5, 412.5, 437.5, 462.5, 487.5,
    512.5]
y_list = [0.008, 0.0048, 0.0032, 0.00327, 0.0023, 0.00212, 0.00187,
    0.00086, 0.00070, 0.00100, 0.00056, 0.00076, 0.00052, 0.00077, 0.00067,
    0.00048, 0.00078, 0.00067, 0.00069, 0.00061, 0.00047]

# Initial guess for the 4 parameters.
guess = (50., 200., 80. / 10000., 6. / 10000.)

# Fit curve to x,y data.
f_prof, f_err = curve_fit(func, x_list, y_list, guess)

# Values for the a,b,c,d fitted parameters.
print f_prof

# Errors (standard deviations) for the fitted parameters.
print np.sqrt(f_err[0][0]), np.sqrt(f_err[1][1]), np.sqrt(f_err[2][2]),\
    np.sqrt(f_err[3][3])

# Generate plot.
plt.scatter(x_list, y_list)
plt.plot(x_list, func(x_list, f_prof[0], f_prof[1], f_prof[2], f_prof[3]))
plt.hlines(y=f_prof[3], xmin=0., xmax=max(x_list))
plt.show()

The results I get are:

# a, b, c, d
 52.74, 2.52e+09, 7.46e-03, 5.69e-04

# errors
11.52, 1.53e+16, 0.0028, 0.00042

The b parameter has a huge value and also a huge error. By looking at the data plotted in the figure, one would estimate by eye that the value for b (ie: the x value where the data set stabilizes) should be around x=300. Why am I getting such a large value for b and its error?

Was it helpful?

Solution

I don't know if this is intentional or a mistake but it looks to me like 'b' will be correlated strongly with 'a' and 'd' and has no "interaction" with the independent variable 'x'. If b/a is large enough, you could approximate 1/np.sqrt(1 + (b / a) ** 2)) ** 2 as a/b, so that your function becomes c * function_of(x, a) - a/b + d

Your 'a' and 'x' values are large enough that this becomes very nearly c*a/x - a/b + d.

As pointed by behzad.nouri, curve_fit can be slightly unstable compared to other minimizers, and always minimizes the least-squares. But it does return the full covariance matrix including correlations between variables (off-diagonal elements of your f_err). Use these!!

If you're sure 'b' has a value around 300, or are interested in easily switching between fmin and the levenberg-marquardt algorithm, you might find the lmfit package (http://lmfit.github.io/lmfit-py/) useful. It allows you to put bounds on parameters, easily switch between fitting algorithms, and also to do a more brute-force exploration of the confidence intervals for the parameters.

OTHER TIPS

you can use a penalty value for the norm of parameters, and use fmin:

from scipy.optimize import fmin

def func(x, a, b, c, d):
    return c * (1 / np.sqrt(1 + (x / a) ** 2) - 1 / np.sqrt(1 + (b / a) ** 2)) ** 2 + d

def errfn(params, xs, ys, lm, ord=1):
    '''
    lm: penalty maltiplier
    ord: order in norm calculation
    '''
    from numpy.linalg import norm
    a, b, c, d = params
    err = func(xs, a, b, c, d) - ys
    return norm(err) + lm * norm(params, ord)

params = fmin(errfn, guess, args=(xs, ys, 1e-6, 2))

above I am using a small penalty of 1e-6 and the fit result are

[6.257e+01   3.956e+02   9.926e-03   7.550e-04]

with a decent fit:

fit

edit: playing with the penalty function and the norm order, it gives a very good fit at

params = [  1.479e+01  -3.344e+00  -8.781e-03   8.347e-03]

fit2

From a quick look, it seems that a large b will eliminate the second term of func():

When b/a goes to infinity, 1 / np.sqrt(1 + (b / a) ** 2)) ** 2 goes to zero.

This suggests to me that this part of the function is not needed in the model and is doing more damage than good.

Just set func to be:

c * (1 / np.sqrt(1 + (np.asarray(x) / a) ** 2) + d
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top