Question

I have a a pandas dataframe containing timestamps like this

     time_0            time_1            time_2            time_3  
21/03/2014 16:17  21/03/2014 15:40  21/03/2014 14:55  21/03/2014 12:50   
21/03/2014 16:29  21/03/2014 16:26  21/03/2014 16:23  21/03/2014 16:21  
04/07/2012 13:43  04/07/2012 13:37  04/07/2012 13:34  04/07/2012 13:29  
19/03/2014 01:41  18/03/2014 01:19  17/03/2014 00:50  05/03/2014 22:30   
21/01/2010 17:22  21/01/2010 17:21  21/01/2010 17:21  21/01/2010 17:21   

I would like to take the difference from each column to the following one (a time delta if you prefer). My desired output is: time_0-time_1, time_1-time_2, time_2-time_3, etc.

The problem I have while doing simply column[i]-column[i+1] is this kind of error:

Traceback (most recent call last):
File "D:\TwitterFakeDetect\src\root\userinforetrieval\reformat.py", line 22, in <module>
deltadatas[column_names[0]]=deltadatas[column_names[0]]-deltadatas[column_names[1]]
File "C:\Users\SC5\Anaconda\lib\site-packages\pandas\core\ops.py", line 496, in wrapper
arr = na_op(lvalues, rvalues)
File "C:\Users\SC5\Anaconda\lib\site-packages\pandas\core\ops.py", line 449, in na_op
result[mask] = op(x[mask], y[mask])
TypeError: unsupported operand type(s) for -: 'str' and 'str'

So I thought this did not work because I had strings instead of dates. I created a function to convert my data frame as such:

def convert_string_to_date(x):
        datetime.strptime(i, '%d/%m/%Y %H:%M')

Then using the built-in apply function in pandas. But then, I realize none of my dates are in fact strings, so that does not really solve the problem.

Traceback (most recent call last):
  File "D:\TwitterFakeDetect\src\root\userinforetrieval\reformat.py", line 14, in <module>
    deltadatas.apply(convert_string_to_date, axis=1)
  File "C:\Users\SC5\Anaconda\lib\site-packages\pandas\core\frame.py", line 3424, in apply
    return self._apply_standard(f, axis, reduce=reduce)
  File "C:\Users\SC5\Anaconda\lib\site-packages\pandas\core\frame.py", line 3494, in _apply_standard
    results[i] = func(v)
  File "D:\TwitterFakeDetect\src\root\userinforetrieval\reformat.py", line 12, in convert_string_to_date
    datetime.strptime(i, '%d/%m/%Y %H:%M')
TypeError: ('must be string, not int', u'occurred at index 0')

To sum up, is there a way of checking the type of my dataframe's content, making sure why I can't get the substraction of my columns as easily as I thought ?

Thanks for any help! I am really stuck on this one!

Was it helpful?

Solution

Parse the datetimes on the read in (the list of list parses a date and time in separate columns (note that this skips the header from above for simplicity)

In [23]: df = read_csv(StringIO(data),sep='\s+',header=None,parse_dates=[[0,1],[2,3],[4,5],[6,7]])

In [24]: df.columns = ['date1','date2','date3','date4']

In [25]: df
Out[25]: 
                date1               date2               date3               date4
0 2014-03-21 16:17:00 2014-03-21 15:40:00 2014-03-21 14:55:00 2014-03-21 12:50:00
1 2014-03-21 16:29:00 2014-03-21 16:26:00 2014-03-21 16:23:00 2014-03-21 16:21:00
2 2012-04-07 13:43:00 2012-04-07 13:37:00 2012-04-07 13:34:00 2012-04-07 13:29:00
3 2014-03-19 01:41:00 2014-03-18 01:19:00 2014-03-17 00:50:00 2014-05-03 22:30:00
4 2010-01-21 17:22:00 2010-01-21 17:21:00 2010-01-21 17:21:00 2010-01-21 17:21:00

[5 rows x 4 columns]

Timedeltas are straightforward

In [26]: DataFrame(dict(td1 = df['date1']-df['date2'], td2 = df['date2']-df['date3'], td3=df['date3']-df['date4']))
Out[26]: 
               td1              td2                td3
0 0 days, 00:37:00 0 days, 00:45:00   0 days, 02:05:00
1 0 days, 00:03:00 0 days, 00:03:00   0 days, 00:02:00
2 0 days, 00:06:00 0 days, 00:03:00   0 days, 00:05:00
3 1 days, 00:22:00 1 days, 00:29:00 -47 days, 21:40:00
4 0 days, 00:01:00 0 days, 00:00:00   0 days, 00:00:00

[5 rows x 3 columns]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top