Loop optimization for random process

Question 1

You could squeeze more work out of the for-loop, but generating all the paths at the same time (assuming you have enough memory):

import numpy as np

def generate_orig(T=1., nt=100000, lbd=500., mu=0, sigma=1., S0=0):
    dt = T/nt
    St  = [S0]  * nt
    sqrtdt = np.sqrt(dt)
    dBt = np.random.normal(0, sqrtdt, nt)

    for k in xrange(1, nt):
        dSt = lbd * (mu - St[k-1]) * dt + sigma * dBt[k]
        St[k] = St[k-1] + dSt
    return St

def generate(T=1., nt=100000, lbd=500., mu=0, sigma=1., S0=0, npaths=1):
    dt = T/nt
    St  = np.full((nt, npaths), S0)
    sqrtdt = np.sqrt(dt)
    dBt = np.random.normal(0, sqrtdt, size=(nt, npaths))

    for k in xrange(1, nt):
        dSt = lbd * (mu - St[k-1]) * dt + sigma * dBt[k]
        St[k] = St[k-1] + dSt
    return St

Here is a timeit benchmark on 100 paths.

In [55]: %timeit [generate_orig() for i in xrange(100)]
1 loops, best of 3: 23.6 s per loop

In [56]: %timeit generate(npaths=100)
1 loops, best of 3: 1.97 s per loop

You might also be able to improve the performance of the for-loop by using Cython.

Question 2

I would like to offer 2 alternative solutions to unutbu's answer. What he writes is The Right Thing™ to do if you don't want to depend on Cython or on a JIT compiler and it is appropriate to generate the output St in batch.

Numba

I grabbed generate_orig() from his answer and turned the Python list St into a numpy array:

import numpy as np

def generate_orig(T=1., nt=100000, lbd=500., mu=0, sigma=1., S0=0):
    dt = T/nt
    St  = np.full(nt, fill_value=S0, dtype=np.float64)
    sqrtdt = np.sqrt(dt)
    dBt = np.random.normal(0, sqrtdt, nt)

    for k in xrange(1, nt):
        dSt = lbd * (mu - St[k-1]) * dt + sigma * dBt[k]
        St[k] = St[k-1] + dSt
    return St

Timings:

%timeit [generate_orig() for i in xrange(100)]
1 loops, best of 3: 25.4 s per loop

No improvements so far, same as before. However, with Numba, just by adding @autojit:

import numpy as np
from numba import autojit

@autojit
def generate_orig(T=1., nt=100000, lbd=500., mu=0, sigma=1., S0=0):
    # The rest is exactly the same as before

The timing drops:

%timeit [generate_orig(1., 100000, 500., 0, 1., 0) for i in xrange(100)]
1 loops, best of 3: 642 ms per loop

I think it's AWESOME! 40x speed-up just for adding @autojit!

Cython

Here is a Cython version with typed memoryviews:

%%cython
# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np

def generate_cython(double T=1., int nt=100000, double lbd=500., double mu=0, double sigma=1., double S0=0):
    cdef int k
    cdef double dt, dSt
    cdef double[:] vSt, vdBt
    dt = T/nt
    St  = np.full(nt, fill_value=S0, dtype=np.float64)
    vSt = St
    vdBt = np.random.normal(0.0, np.sqrt(dt), nt)
    for k in xrange(1, nt):
        dSt = lbd * (mu - vSt[k-1]) * dt + sigma * vdBt[k]
        vSt[k] = vSt[k-1] + dSt
    return St

Timing:

%timeit [generate_cython(1., 100000, 500., 0, 1., 0) for i in xrange(100)]
1 loops, best of 3: 638 ms per loop

The code is exactly as fast as the Numba version (the tiny difference is just noise). However the code became ugly, all those type declarations make it clumsy. :( Well, not a disaster but still.

Both solutions give a 3x speed-up compared to unutbu's answer, which runs in 1.97 s on my machine. However, as I said at the beginning, his solution is the way to go if you don't want to depend on Cython or on Numba. (Both of them have drawbacks; it is understandable if someone wants to avoid such a dependence.)

What would happen if we applied either Numba or Cython to unutbu's solution? Would that result in an even faster code? No. With Numba, there is no difference. Cython makes things slightly worse. Well, perhaps a Cython guru could come up with a better solution...