Question

I've got some data which has the login and logout times for a series of users.

Input:

        Login        Logout
User_1  10:25AM      6:01PM
User_2  8:58AM       5:12PM
User_3  9:23AM       1:35PM
User_3  3:10PM       4:49PM

I'd like to be able to find out the number of users that were logged in during a time period (for example, each Hour).

I'd like to be able to correlate this to other data I have in Pandas for the same periods, e.g. the number of "Foo" Events during that time.

Desired Output:

          Num Logged In   Foo Event Count
9:00AM                1                11
10:00AM               2                17
11:00AM               3                28
12:00PM               3                26
1:00PM                3                22
2:00PM                2                15
3:00PM                2                15
4:00PM                3                22
5:00PM                2                13

In the simplest case I could get the number of users logged in at exactly 10:00AM, and that would be a useful start. If I were looking at re-sampling the data to Day periods, then I'd need to be cleverer and look at something like the maximum simultaneous logins, or the average number of simultaneous logins between 9:00AM to 5:00PM.

Obviously I could write plain Python that, given the period I re-sampled to in Pandas, could give me the Series I need, but I'd like to know if there is a trick within Pandas that helps me with this, or something I could do in Numpy, as I want to apply this to largish datasets (hundreds of users, thousands of days, multiple login/logouts a day per user).

Was it helpful?

Solution

I found an approach that seems to work well:

Assuming we can transform our Login/Logout data into two DataFrames indexed by time:

Login    UserLogin
-------- ---------
8:58AM   User_2    
9:23AM   User_3    
10:25AM  User_1    
3:10PM   User_3    

Logout   UserLogout
-------- ----------
1:35PM   User_3
4:49PM   User_3
5:12PM   User_2
6:01PM   User_1

Then we can add an additional column to each table: 1 for the logins, and -1 for the logouts:

login['AvailabilityDelta'] = 1
logout['AvailabilityDelta'] = -1

Then we can perform an outer join on the two tables, and fill the NA values the join created with 0s:

events = login.join(logout, how='outer')
events.fillna(value=0, inplace=True)

On the newly joined "Events" DataFrame, we then create an "AvailabilityDelta" column that is the sum of the "Login" and "Logout" columns (from the login and logout DataFrames +1s and -1s we added above):

events['AvailabilityDelta'] = events.Login + events.Logout

Finally we can create an "Availability" column by performing a cumulative sum on the "AvailabilityDelta" column. This gives us the "Num Logged In" data that we were after in the original question:

events['Availability'] = events.AvailabilityDelta.cumsum()

At this point it is simple to add in additional information or create TimeSeries data, e.g.:

ts = events.resample('1H', how='mean', fill_method='ffill')

OTHER TIPS

Look at Arrow Module - it provides very universal DateTime objects with high level methods.

Ranges & spans

Get the timespan of any unit:

>>> arrow.utcnow().span('hour')
(<Arrow [2013-05-07T05:00:00+00:00]>, <Arrow [2013-05-07T05:59:59.999999+00:00]>)

Or just get the floor and ceiling:

>>> arrow.utcnow().floor('hour')
<Arrow [2013-05-07T05:00:00+00:00]>

>>> arrow.utcnow().ceil('hour')
<Arrow [2013-05-07T05:59:59.999999+00:00]>

Your best bet would be to convert the times using something like strptime:

import time
t = time.strptime("5:24pm", "%H:%M%p")
>>> t.tm_hour
5
>>> t.tm_min
24

That way you can get everything in the same hour, for example, like you wanted.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top