سؤال

I am trying to use scipy.optimize.leastsq with a fit function that uses preallocated memory to store the residuals. I have millions of nonlinear fits and time is critical. I had the fit function coded in C, and I noticed that scipy.optimize.leastsq didn't work correctly when the fit function simply returned a reference to the buffer memory that was obtained as an input. I thought I had screwed up with the Py_INCREFs when I noticed that the problem can be reproduced in pure Python. Here is some mock-up code that illustrates the issue (the actual problem has different fit functions and is much more complicated):

import numpy as np
import scipy.optimize as opt

# the mock-up data to be fitted
x = np.arange(5.)
noise = np.array([0.0, -0.02, 0.01, -0.03, 0.01])
y = np.exp(-x) + noise

# preallocate the buffer memory
buf = np.empty_like(x)

# fit function writing residuals to preallocated memory and returning a reference 
def dy(p, x, buf, y):
    buf[:] = p[0]*np.exp(-p[1]*x) - y
    return buf

# starting values (true values are [1., 1.])
p0 = np.array([1.2, 0.8])

# leastsq with dy(): DOESN'T WORK CORRECTLY
opt.leastsq(dy, p0, args=(x, buf, y))
# -> (array([ 1.2,  0.8]), 4)

To make it work correctly, I have to wrap the fit function into one that makes a copy:

# wrapper that calls dy() and returns a copy of the array
def dy2(*args): 
    return dy(*args).copy()

# leastsq with dy2(): WORKS CORRECTLY
opt.leastsq(dy2, p0, args=(x, buf, y))
# -> (array([ 0.99917134,  1.03603201]), 1)

...but making a copy obviously defeats using buffer memory in the first place! As a side note, opt.fmin also works with the buffer memory like so (but is in practice much too slow for my application):

def sum2(p, x, buf, y):
    dy(p, x, buf, y)
    return buf.dot(buf)

opt.fmin(sum2, p0, args=(x, buf, y))
# Optimization terminated successfully.
#     Current function value: 0.001200
#     Iterations: 32
#     Function evaluations: 63
# -> array([ 0.99915812,  1.03600273])

Any ideas why scipy.optimize.leastsq works correctly with dy2() and not with dy() in the above example?

هل كانت مفيدة؟

المحلول

Ok, I think this is what happens here: The underlying FORTRAN routine LMDIF presents the user-defined function with memory in *fvec, where the result is to be stored. This pointer may point to scratch memory, since LMDIF needs to cache the results of several function evaluations to estimate the Jacobian.

Since the user-defined function is called from Python, the memory in *fvec cannot be used directly, so the wrapper raw_multipack_lm_function() works by evaluating the Python function first and then coping the result to *fvec. Before entering LMDIF, the user-defined function is called once to find out the shape of the output array.

The problem arises because the memory from the first function evaluation is then passed to LMDIF as the original *fvec, as if it was new disposable memory. LMDIF goes on to use it to store the first function evaluation in and then calls the user function again with a different *fvec. But in the example with dy(), the user function overwrites the memory from the previous function call before the result is copied to the memory where LMDIF wants it. This fools LMDIF into thinking that the result never changes, hence exiting with the case that the fit parameters have no impact on the quality of the fit.

I consider this a bug in minpack_lmdif() from scipy/optimize/__minpack.h, since it assumes that the user-defined function always returns new disposable memory, which is not the case with dy() (which seems to be a perfectly legitimate way to code the fit function). The following git diff illustrates an easy fix:

diff --git a/scipy/optimize/__minpack.h b/scipy/optimize/__minpack.h
index 2c0ea33..465724b 100644
--- a/scipy/optimize/__minpack.h
+++ b/scipy/optimize/__minpack.h
@@ -483,7 +483,7 @@ static PyObject *minpack_lmdif(PyObject *dummy, PyObject *args) {
   qtf = (double *) ap_qtf->data;
   fjac = (double *) ap_fjac->data;
   ldfjac = dims[1];
-  wa = (double *)malloc((3*n + m)* sizeof(double));
+  wa = (double *)malloc((3*n + 2*m)* sizeof(double));
   if (wa == NULL) {
     PyErr_NoMemory();
     goto fail;
@@ -492,12 +492,15 @@ static PyObject *minpack_lmdif(PyObject *dummy, PyObject *args) {

   /* Call the underlying FORTRAN routines. */
   n_int = n; /* to provide int*-pointed storage for int argument of LMDIF */
-  LMDIF(raw_multipack_lm_function, &m, &n_int, x, fvec, &ftol, &xtol, &gtol, &maxfev, &epsfcn, diag,
-    
+  LMDIF(raw_multipack_lm_function, &m, &n_int, x, wa+3*n+m, &ftol, &xtol, &gtol, &maxfev, &epsfcn, d
+
   RESTORE_FUNC();

   if (info < 0) goto fail;           /* Python error */

+  /* Copy final residuals back to initial array */
+  memcpy(fvec, wa+3*n+m, m*sizeof(double));
+
   free(wa);
   Py_DECREF(extra_args); 
   Py_DECREF(ap_diag);

So we allocate m more elements of scratch space for LMDIF and use the additional memory as the initial *fvec. This prevents the memory collision when the user function is called. To return the correct final residual, an additional memcpy() is needed to store the final result in the original array.

As in the original example with dy(), this allows one to code the fit function to be free from memory allocations. Since the whole inner loop in LMDIF is then free of memory allocations, an improved performance can be expected.

Update

Here are some timing results. Obviously, the test problem is very small and fast converging, so it might not be representative for real-world applications. This is with the patched version of scipy.optimize.leastsq:

In [1]: def dy0(p, x, y): return p[0]*np.exp(-p[1]*x) - y

In [2]: %timeit p = opt.leastsq(dy2, p0, args=(x, buf, y))
1000 loops, best of 3: 399 us per loop

In [3]: %timeit p = opt.leastsq(dy, p0, args=(x, buf, y))
1000 loops, best of 3: 363 us per loop

In [4]: %timeit p = opt.leastsq(dy0, p0, args=(x, y))
1000 loops, best of 3: 341 us per loop

So nothing can be gained by writing to preallocated memory: The straightforward implementation of dy0() is the fastest. But what if we write a more efficient wrapper for LMDIF that takes better advantage of the preallocated memory? Here is one I wrote:

In [5]: %timeit p = mp.leastsq(dy, (p0.copy(), x, buf, y))
1000 loops, best of 3: 279 us per loop

That's something. mp.leastsq still evaluates a general Python function, with the restriction that the 1st argument will be overwritten with the result and the 3rd argument is the buffer memory. But let's see what happens if we code dy() in C:

In [6]: %timeit p = opt.leastsq(fitfun.e2_diff, p0, args=(x, buf, y))
10000 loops, best of 3: 48.2 us per loop

Ouch! So much for numpy's performance of perfectly vectorized code (at least for short arrays). Let's use the improved wrapper:

In [7]: %timeit p = mp.leastsq(fitfun.e2_diff, (p0.copy(), x, buf, y))
100000 loops, best of 3: 6.94 us per loop

The additional acceleration between opt.leastsq and mp.leastsq comes from getting rid of tuple-building code and memcpys. Finally, the raw performance of LMDIF without calling back into Python:

In [8]: %timeit p = fitfun.e2_fit(p0.copy(), x, buf, y)
100000 loops, best of 3: 6.13 us per loop

Not much different. So calling back into Python doesn't appear to cost much, but don't let numpy do the computation for you! Further speedups for many subsequent fits (my application) can come from reusing the scratch memory wa for LMDIF.

Most importantly, all computations return the correct result now!

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top