Question

I have a dataframe that contains observations from multiple entities over time. The index is a time series and is unique, but irregular.

The a section of the dataframe looks like this:

DATE    ('ACTION', 111, 1/7/2010)   ('ACTION', 222, 1/5/2010)
1/1/2010    10                          5
1/2/2010    10                          5
1/3/2010    10                          5
1/4/2010    15                          5
1/5/2010    10                          5
1/6/2010    10                          5
1/7/2010    10                          5
1/8/2010    10                          5

The tuple is a hierarchical index. In the tuple value 1 is a category, value 2 is an ID and value 3 is an event date. I want to use this event date as the maximum date -1 in the column and replace values after that date with NaN

The new frame would look like this:

DATE    ('ACTION', 111, 1/7/2010)   ('ACTION', 222, 1/5/2010)
1/1/2010    10                          5
1/2/2010    10                          5
1/3/2010    10                          5
1/4/2010    15                          5
1/5/2010    10                          NaN
1/6/2010    10                          NaN
1/7/2010    NaN                         NaN
1/8/2010    NaN                         NaN

The dataframe could potentially contain 100000 columns. I understand how to replace the value is one column I think using a Boolean mask. I do not understand how to efficiently do this over multiple columns.

The reason for needing this is to make sure observations are prior to an individual event that occurs at the event date. Any help would be highly appreciated.

Was it helpful?

Solution

Maybe also not that fast, but already a cleaner approach based on pandas:

df.where(df.apply(lambda x: x.index < pd.Timestamp(x.name[2])))

The apply returns a dataframe with True/False values (the < expression is evaluated for each column where x.name[2] selects the third level of that column name), and the where replaces the False values with NaN.

Full example:

In [1]: import pandas as pd

In [2]: from StringIO import StringIO

In [3]: s = """,ACTION,ACTION
   ...: ,111,222
   ...: ,1/7/2010,1/5/2010
   ...: DATE,,
   ...: 1/1/2010,    10,                          5
   ...: 1/2/2010,    10,                          5
   ...: 1/3/2010,    10,                          5
   ...: 1/4/2010,    15,                          5
   ...: 1/5/2010,    10,                          5
   ...: 1/6/2010,    10,                          5
   ...: 1/7/2010,    10,                          5
   ...: 1/8/2010,    10,                          5"""

In [4]: df = pd.read_csv(StringIO(s), header=[0,1,2], index_col=0, parse_dates=True)

In [5]: df.where(df.apply(lambda x: x.index < pd.Timestamp(x.name[2])))
Out[5]:
              ACTION
                 111       222
            1/7/2010  1/5/2010
DATE
2010-01-01        10         5
2010-01-02        10         5
2010-01-03        10         5
2010-01-04        15         5
2010-01-05        10       NaN
2010-01-06        10       NaN
2010-01-07       NaN       NaN
2010-01-08       NaN       NaN

OTHER TIPS

I am sure there may be better way to do this, but three lines would do the job

In [194]:

A=(np.array(pd.to_datetime(df['DATE']))[...,np.newaxis]+12*60*12*10**10)>\
   np.array([np.datetime64(pd.to_datetime(item[-1])) for item in df.columns.tolist()[1:]])
B=np.hstack((np.ones(len(df)).reshape((-1,1))!=1, A))
print df.where(~B)

#       DATE  (ACTION, 111, 1/7/2010)  (ACTION, 222, 1/5/2010)
#0  1/1/2010                       10                        5
#1  1/2/2010                       10                        5
#2  1/3/2010                       10                        5
#3  1/4/2010                       15                        5
#4  1/5/2010                       10                      NaN
#5  1/6/2010                       10                      NaN
#6  1/7/2010                      NaN                      NaN
#7  1/8/2010                      NaN                      NaN

#[8 rows x 3 columns]

I assume your DATE column is stored as string and the last item in each tuple in your column names is also stored in string. If both are the case, you will need the conversions in the first line, otherwise you may skip some.

Edit: It runs quire slow, 100 loops, best of 3: 4.55 ms per loop.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top