Fit function with preallocated buffer memory in scipy.optimize.leastsq

Question

Ok, I think this is what happens here: The underlying FORTRAN routine LMDIF presents the user-defined function with memory in *fvec, where the result is to be stored. This pointer may point to scratch memory, since LMDIF needs to cache the results of several function evaluations to estimate the Jacobian.

Since the user-defined function is called from Python, the memory in *fvec cannot be used directly, so the wrapper raw_multipack_lm_function() works by evaluating the Python function first and then coping the result to *fvec. Before entering LMDIF, the user-defined function is called once to find out the shape of the output array.

The problem arises because the memory from the first function evaluation is then passed to LMDIF as the original *fvec, as if it was new disposable memory. LMDIF goes on to use it to store the first function evaluation in and then calls the user function again with a different *fvec. But in the example with dy(), the user function overwrites the memory from the previous function call before the result is copied to the memory where LMDIF wants it. This fools LMDIF into thinking that the result never changes, hence exiting with the case that the fit parameters have no impact on the quality of the fit.

I consider this a bug in minpack_lmdif() from scipy/optimize/__minpack.h, since it assumes that the user-defined function always returns new disposable memory, which is not the case with dy() (which seems to be a perfectly legitimate way to code the fit function). The following git diff illustrates an easy fix:

diff --git a/scipy/optimize/__minpack.h b/scipy/optimize/__minpack.h
index 2c0ea33..465724b 100644
--- a/scipy/optimize/__minpack.h
+++ b/scipy/optimize/__minpack.h
@@ -483,7 +483,7 @@ static PyObject *minpack_lmdif(PyObject *dummy, PyObject *args) {
   qtf = (double *) ap_qtf->data;
   fjac = (double *) ap_fjac->data;
   ldfjac = dims[1];
-  wa = (double *)malloc((3*n + m)* sizeof(double));
+  wa = (double *)malloc((3*n + 2*m)* sizeof(double));
   if (wa == NULL) {
     PyErr_NoMemory();
     goto fail;
@@ -492,12 +492,15 @@ static PyObject *minpack_lmdif(PyObject *dummy, PyObject *args) {

   /* Call the underlying FORTRAN routines. */
   n_int = n; /* to provide int*-pointed storage for int argument of LMDIF */
-  LMDIF(raw_multipack_lm_function, &m, &n_int, x, fvec, &ftol, &xtol, &gtol, &maxfev, &epsfcn, diag,
-    
+  LMDIF(raw_multipack_lm_function, &m, &n_int, x, wa+3*n+m, &ftol, &xtol, &gtol, &maxfev, &epsfcn, d
+
   RESTORE_FUNC();

   if (info < 0) goto fail;           /* Python error */

+  /* Copy final residuals back to initial array */
+  memcpy(fvec, wa+3*n+m, m*sizeof(double));
+
   free(wa);
   Py_DECREF(extra_args); 
   Py_DECREF(ap_diag);

So we allocate m more elements of scratch space for LMDIF and use the additional memory as the initial *fvec. This prevents the memory collision when the user function is called. To return the correct final residual, an additional memcpy() is needed to store the final result in the original array.

As in the original example with dy(), this allows one to code the fit function to be free from memory allocations. Since the whole inner loop in LMDIF is then free of memory allocations, an improved performance can be expected.

Update

Here are some timing results. Obviously, the test problem is very small and fast converging, so it might not be representative for real-world applications. This is with the patched version of scipy.optimize.leastsq:

In [1]: def dy0(p, x, y): return p[0]*np.exp(-p[1]*x) - y

In [2]: %timeit p = opt.leastsq(dy2, p0, args=(x, buf, y))
1000 loops, best of 3: 399 us per loop

In [3]: %timeit p = opt.leastsq(dy, p0, args=(x, buf, y))
1000 loops, best of 3: 363 us per loop

In [4]: %timeit p = opt.leastsq(dy0, p0, args=(x, y))
1000 loops, best of 3: 341 us per loop

So nothing can be gained by writing to preallocated memory: The straightforward implementation of dy0() is the fastest. But what if we write a more efficient wrapper for LMDIF that takes better advantage of the preallocated memory? Here is one I wrote:

In [5]: %timeit p = mp.leastsq(dy, (p0.copy(), x, buf, y))
1000 loops, best of 3: 279 us per loop

That's something. mp.leastsq still evaluates a general Python function, with the restriction that the 1st argument will be overwritten with the result and the 3rd argument is the buffer memory. But let's see what happens if we code dy() in C:

In [6]: %timeit p = opt.leastsq(fitfun.e2_diff, p0, args=(x, buf, y))
10000 loops, best of 3: 48.2 us per loop

Ouch! So much for numpy's performance of perfectly vectorized code (at least for short arrays). Let's use the improved wrapper:

In [7]: %timeit p = mp.leastsq(fitfun.e2_diff, (p0.copy(), x, buf, y))
100000 loops, best of 3: 6.94 us per loop

The additional acceleration between opt.leastsq and mp.leastsq comes from getting rid of tuple-building code and memcpys. Finally, the raw performance of LMDIF without calling back into Python:

In [8]: %timeit p = fitfun.e2_fit(p0.copy(), x, buf, y)
100000 loops, best of 3: 6.13 us per loop

Not much different. So calling back into Python doesn't appear to cost much, but don't let numpy do the computation for you! Further speedups for many subsequent fits (my application) can come from reusing the scratch memory wa for LMDIF.

Most importantly, all computations return the correct result now!