Fit simulated and experimental data points with Python

https://stackoverflow.com/questions/7445070

21-01-2021
|

Question

I have written some code which performs a Monte Carlo simulation and produces curves of signal intensity versus time. The shape of such a curve depends on various parameters, two of which my collaborator wants to determine by the 'real life version' of the experiment I am simulating.

We are ready to compare her experimental data with my simulated curves, but now I am stuck, as I have not yet been able to perform any fit (so far I have replaced the experimental data with simulated noisy data for testing). I have tried using scipy.optimize.leastsq, which exits with code 2 (according to the documentation this means that the fitting was successful), but it mostly simply returns the values (not exactly the same, but close) I input as an initial guess, no matter how close to or far off they were from the true values. If it does report different values, the resulting curve still differs significantly from the true one.

Another observation is that infodict['nfev'] invariably contains

The relative error between two consecutive iterates is at most 0.000000

While using my simulated noisy data I have played around with both parameters' true values being of the same order of magnitude (as I thought that the step size in use might only sensibly affect one of them otherwise), of a very different order of magnitude, and I have varied the step size (parameter epsfcn), but to no avail.

Does anyone have any idea what I might be doing wrong, or what fitting function I could use instead of leastsq? If so: thanks a lot in advance!

EDIT

As suggested by Russ I will now provide some detail on how the simulation is performed: we're looking at small molecules binding to large molecules. This happens with a probability which depends on their mutual affinity (the affinity is one of the values to be extracted from the experimental data). Once binding has occurred, we also simulate how long it takes until the complex falls apart again (the dissociation time constant is the second parameter we're interested in). There are a number of other parameters, but they only become relevant when the expected signal intensity is calculated, so they are not relevant for the actual simulation.

We start off with a given number of small molecules, the state of each of which is simulated for a number of time steps. At each time step we use the affinity value to determine whether this molecule, in case it's unbound, binds to a large molecule. In case it's bound already we use the dissociation time constant and the amount of time for which it has already been bound to determine whether it dissociates in this step.

In both cases the parameter (affinity, dissociation time constant) is used to calculate a probability, which is then compared to a random number (between 0 and 1), and on this comparison it depends if the state of the small molecule (bound/unbound) changes.

EDIT 2

There is no well-defined function which determines the shape of the resulting curve, and, even though the shape is clearly reproducible, there is an element of randomness to each individual data point. Therefore I have now attempted using optimize.fmin instead of leastsq, but it does not converge and simply exits after the maximum number of iterations have been performed.

EDIT 3

As suggested by Andrea I have uploaded a sample plot. I don't really think providing sample data would help a great deal, it's just one y-value (signal intensity) per x-value (time)...

Solution

Not exactly an answer but there is also PyMinuit to try.

http://code.google.com/p/pyminuit/

What you want to do is convert your pdf and data points to chi^2 or -ln(likelihood) or a metric of your choosing and use PyMinuit to minimize that metric. It can be configured to be very verbose so you can find out where things went wrong(if it did go wrong).

OTHER TIPS

For fitting your data with an arbitrary function you usually want the Levenberg–Marquardt algorithm, and that is what scipy.optimize.leastsq uses, so you are most likely using the correct function.

Have a look at the tutorial in section 5.4 of this page and see if it helps.

It is also possible your underlying function is difficult to fit (what is the function?), but it sounds like you may be having other issues.

Also - as great as StackOverflow is, you'll likely get much more knowledgeable responses for scipy questions by posting directly to the Scipy-User mailing list with some sample code and more detail.

If you don't know the expected functional form of the global, but can predict an expected value for the "next" point given the current state of the system you might consider using a Kalman filter (yeah, "filter" sounds goofy in a fitting context but the name is historical and can't easily be changed now).

The underlying math can look a bit scary, but this important point is that you don't have to understand it. You typically need to be able to

Define a representation space
Express your data (simulated or experimental) in the representation space.
Define a update procedure that gets a "next" representation from a given one
Extract the desired curve from a series representation as returned by the fitter

There seems to be at least one existing python package to support this (note that the interface here is different from the ones that I'm used to and I can't offer much advice on using it).

Since you only have two parameters, you should just do a grid search.

results = {}
for p0 in parameter_space_for_first_parameter:
     for p1 in parameter_space_for_second_parameter:
           results[p0,p1] = compare(p0,p1)

If you can afford the computation, compare should perform multiple runs (with different initialisations) and compute mean and standard deviations. You can try using my package jug to manage your computations (it was designed for exactly this sort of thing).

Finally, plot the results, look at the minima (there might be several). This is a "stupid" method, but it will work in many situations where other methods get stuck.

If this is too much computation, you can do this in two passes: a coarse-grained grid followed by a fine-grained one near the minima of the coarse-grained space.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow