Ok, I think this is what happens here: The underlying FORTRAN
routine LMDIF
presents the user-defined function with memory in *fvec
, where the result is to be stored. This pointer may point to scratch memory, since LMDIF
needs to cache the results of several function evaluations to estimate the Jacobian.
Since the user-defined function is called from Python
, the memory in *fvec
cannot be used directly, so the wrapper raw_multipack_lm_function()
works by evaluating the Python
function first and then coping the result to *fvec
. Before entering LMDIF
, the user-defined function is called once to find out the shape of the output array.
The problem arises because the memory from the first function evaluation is then passed to LMDIF
as the original *fvec
, as if it was new disposable memory. LMDIF
goes on to use it to store the first function evaluation in and then calls the user function again with a different *fvec
. But in the example with dy()
, the user function overwrites the memory from the previous function call before the result is copied to the memory where LMDIF
wants it. This fools LMDIF
into thinking that the result never changes, hence exiting with the case that the fit parameters have no impact on the quality of the fit.
I consider this a bug in minpack_lmdif()
from scipy/optimize/__minpack.h
, since it assumes that the user-defined function always returns new disposable memory, which is not the case with dy()
(which seems to be a perfectly legitimate way to code the fit function). The following git diff
illustrates an easy fix:
diff --git a/scipy/optimize/__minpack.h b/scipy/optimize/__minpack.h
index 2c0ea33..465724b 100644
--- a/scipy/optimize/__minpack.h
+++ b/scipy/optimize/__minpack.h
@@ -483,7 +483,7 @@ static PyObject *minpack_lmdif(PyObject *dummy, PyObject *args) {
qtf = (double *) ap_qtf->data;
fjac = (double *) ap_fjac->data;
ldfjac = dims[1];
- wa = (double *)malloc((3*n + m)* sizeof(double));
+ wa = (double *)malloc((3*n + 2*m)* sizeof(double));
if (wa == NULL) {
PyErr_NoMemory();
goto fail;
@@ -492,12 +492,15 @@ static PyObject *minpack_lmdif(PyObject *dummy, PyObject *args) {
/* Call the underlying FORTRAN routines. */
n_int = n; /* to provide int*-pointed storage for int argument of LMDIF */
- LMDIF(raw_multipack_lm_function, &m, &n_int, x, fvec, &ftol, &xtol, >ol, &maxfev, &epsfcn, diag,
-
+ LMDIF(raw_multipack_lm_function, &m, &n_int, x, wa+3*n+m, &ftol, &xtol, >ol, &maxfev, &epsfcn, d
+
RESTORE_FUNC();
if (info < 0) goto fail; /* Python error */
+ /* Copy final residuals back to initial array */
+ memcpy(fvec, wa+3*n+m, m*sizeof(double));
+
free(wa);
Py_DECREF(extra_args);
Py_DECREF(ap_diag);
So we allocate m
more elements of scratch space for LMDIF
and use the additional memory as the initial *fvec
. This prevents the memory collision when the user function is called. To return the correct final residual, an additional memcpy()
is needed to store the final result in the original array.
As in the original example with dy(), this allows one to code the fit function to be free from memory allocations. Since the whole inner loop in LMDIF
is then free of memory allocations, an improved performance can be expected.
Update
Here are some timing results. Obviously, the test problem is very small and fast converging, so it might not be representative for real-world applications. This is with the patched version of scipy.optimize.leastsq
:
In [1]: def dy0(p, x, y): return p[0]*np.exp(-p[1]*x) - y
In [2]: %timeit p = opt.leastsq(dy2, p0, args=(x, buf, y))
1000 loops, best of 3: 399 us per loop
In [3]: %timeit p = opt.leastsq(dy, p0, args=(x, buf, y))
1000 loops, best of 3: 363 us per loop
In [4]: %timeit p = opt.leastsq(dy0, p0, args=(x, y))
1000 loops, best of 3: 341 us per loop
So nothing can be gained by writing to preallocated memory: The straightforward implementation of dy0()
is the fastest. But what if we write a more efficient wrapper for LMDIF
that takes better advantage of the preallocated memory? Here is one I wrote:
In [5]: %timeit p = mp.leastsq(dy, (p0.copy(), x, buf, y))
1000 loops, best of 3: 279 us per loop
That's something. mp.leastsq
still evaluates a general Python
function, with the restriction that the 1st argument will be overwritten with the result and the 3rd argument is the buffer memory. But let's see what happens if we code dy()
in C
:
In [6]: %timeit p = opt.leastsq(fitfun.e2_diff, p0, args=(x, buf, y))
10000 loops, best of 3: 48.2 us per loop
Ouch! So much for numpy
's performance of perfectly vectorized code (at least for short arrays). Let's use the improved wrapper:
In [7]: %timeit p = mp.leastsq(fitfun.e2_diff, (p0.copy(), x, buf, y))
100000 loops, best of 3: 6.94 us per loop
The additional acceleration between opt.leastsq
and mp.leastsq
comes from getting rid of tuple-building code and memcpy
s. Finally, the raw performance of LMDIF
without calling back into Python
:
In [8]: %timeit p = fitfun.e2_fit(p0.copy(), x, buf, y)
100000 loops, best of 3: 6.13 us per loop
Not much different. So calling back into Python doesn't appear to cost much, but don't let numpy
do the computation for you! Further speedups for many subsequent fits (my application) can come from reusing the scratch memory wa
for LMDIF
.
Most importantly, all computations return the correct result now!