Question

I need to optimize this loop which takes 2.5 second. The needs is that I call it more than 3000 times in my script. The aim of this code is to create two matrix which are used after in a linear system.

Has someone any idea in Python or Cython?

 ## df is only here for illustration and date_indicatrice changes upon function call
 df     = pd.DataFrame(0, columns=range(6), 
                       index=pd.date_range(start = pd.datetime(2010,1,1),
                       end = pd.datetime(2020,1,1), freq="H"))
 mat    = pd.DataFrame(0,index=df.index,columns=range(6))
 mat_bp = pd.DataFrame(0,index=df.index,columns=range(6*2))

 date_indicatrice = [(pd.datetime(2010,1,1), pd.datetime(2010,4,1)),
                     (pd.datetime(2012,5,1), pd.datetime(2019,4,1)),
                     (pd.datetime(2013,4,1), pd.datetime(2019,4,1)),
                     (pd.datetime(2014,3,1), pd.datetime(2019,4,1)),
                     (pd.datetime(2015,1,1), pd.datetime(2015,4,1)),
                     (pd.datetime(2013,6,1), pd.datetime(2018,4,1))]

timer = time.time()

for j, (d1,d2) in enumerate(date_indicatrice):
    result      = df[(mat.index>=d1)&(mat.index<=d2)]
    result2     = df[(mat.index>=d1)&(mat.index<=d2)&(mat.index.hour>=8)]
    mat.loc[result.index,j]       = 1.
    mat_bp.loc[result2.index,j*2] = 1.
    mat_bp[j*2+1] = (1 - mat_bp[j*2]) * mat[j]

print time.time()-timer
Was it helpful?

Solution

Here you go. I tested the following and I get the same resultant matrices in mat and mat_bp as in your original code, but in 0.07 seconds vs. 1.4 seconds for the original code on my machine.

The real slowdown was due to using result.index and result2.index. Looking up by a datetime is much slower than looking up using an index. I used binary searches where possible to find the right indices.

import pandas as pd
import numpy as np
import time
import bisect
## df is only here for illustration and date_indicatrice changes upon function call
df     = pd.DataFrame(0, columns=range(6),
                      index=pd.date_range(start = pd.datetime(2010,1,1),
                      end = pd.datetime(2020,1,1), freq="H"))
mat    = pd.DataFrame(0,index=df.index,columns=range(6))
mat_bp = pd.DataFrame(0,index=df.index,columns=range(6*2))

date_indicatrice = [(pd.datetime(2010,1,1), pd.datetime(2010,4,1)),
                    (pd.datetime(2012,5,1), pd.datetime(2019,4,1)),
                    (pd.datetime(2013,4,1), pd.datetime(2019,4,1)),
                    (pd.datetime(2014,3,1), pd.datetime(2019,4,1)),
                    (pd.datetime(2015,1,1), pd.datetime(2015,4,1)),
                    (pd.datetime(2013,6,1), pd.datetime(2018,4,1))]

timer = time.time()

for j, (d1,d2) in enumerate(date_indicatrice):
    ind_start = bisect.bisect_left(mat.index, d1)
    ind_end = bisect.bisect_right(mat.index, d2)
    inds = np.array(xrange(ind_start, ind_end))
    valid_inds = inds[mat.index[ind_start:ind_end].hour >= 8]
    mat.loc[ind_start:ind_end,j]       = 1.
    mat_bp.loc[valid_inds,j*2] = 1.
    mat_bp[j*2+1] = (1 - mat_bp[j*2]) * mat[j]

print time.time()-timer
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top