Question

I have a dataframe of 600 000 x/y points with date-time information, along another field 'status', with extra descriptive information

My objective is, for each record:

  • sum column 'status' by records that are within a certain spatial temporal buffer

the specific buffer is within t - 8 hours and < 100 meters

Currently I have the data in a pandas data frame.

I could, loop through the rows, and for each record, subset the dates of interest, then calculate a distances and restrict the selection further. However that would still be quite slow with so many records.

  • THIS TAKES 4.4 hours to run.

I can see that I could create a 3 dimensional kdtree with x, y, date as epoch time. However, I am not certain how to restrict the distances properly when incorporating dates and geographic distances.

Here is some reproducible code for you guys to test on:

Import

import numpy.random as npr
import numpy
import pandas as pd
from pandas import DataFrame, date_range
from datetime import datetime, timedelta

Create data

np.random.seed(111)

Function to generate test data

def CreateDataSet(Number=1):

    Output = []

    for i in range(Number):

        # Create a date range with hour frequency
        date = date_range(start='10/1/2012', end='10/31/2012', freq='H')

        # Create long lat data
        laty = npr.normal(4815862, 5000,size=len(date))
        longx = npr.normal(687993, 5000,size=len(date))

        # status of interest
        status = [0,1]

        # Make a random list of statuses
        random_status = [status[npr.randint(low=0,high=len(status))] for i in range(len(date))]

        # user pool
        user = ['sally','derik','james','bob','ryan','chris']

        # Make a random list of users 
        random_user = [user[npr.randint(low=0,high=len(user))] for i in range(len(date))]

        Output.extend(zip(random_user, random_status, date, longx, laty))

    return pd.DataFrame(Output, columns = ['user', 'status', 'date', 'long', 'lat'])

#Create data  
data = CreateDataSet(3)
len(data)
#some time deltas
before = timedelta(hours = 8)
after = timedelta(minutes = 1)

Function to speed up

def work(df):

    output = []
    #loop through data index's
    for i in range(0, len(df)):
    l = []
        #first we will filter out the data by date to have a smaller list to compute distances for

        #create a mask to query all dates between range for date i
        date_mask = (df['date'] >= df['date'].iloc[i]-before) & (df['date'] <= df['date'].iloc[i]+after)
        #create a mask to query all users who are not user i (themselves)
        user_mask = df['user']!=df['user'].iloc[i]
        #apply masks
        dists_to_check = df[date_mask & user_mask]

        #for point i, create coordinate to calculate distances from
        a = np.array((df['long'].iloc[i], df['lat'].iloc[i]))
        #create array of distances to check on the masked data
        b = np.array((dists_to_check['long'].values, dists_to_check['lat'].values))

        #for j in the date queried data
        for j in range(1, len(dists_to_check)):
            #compute the ueclidean distance between point a and each point of b (the date masked data)
            x = np.linalg.norm(a-np.array((b[0][j], b[1][j])))

            #if the distance is within our range of interest append the index to a list
            if x <=100:
                l.append(j)
            else:
                pass
        try:
            #use the list of desired index's 'l' to query a final subset of the data
            data = dists_to_check.iloc[l]
            #summarize the column of interest then append to output list
            output.append(data['status'].sum())
        except IndexError, e:
            output.append(0)
            #print "There were no data to add"

    return pd.DataFrame(output)

Run code and time it

start = datetime.now()
out = work(data)
print datetime.now() - start

Is there a way to do this query in a vectorized way? Or should I be chasing another technique.

<3

Was it helpful?

Solution

Here is what at least somewhat solves my problem. Since the loop can operate on different parts of the data independently, parallelization makes sense here.

using Ipython...

from IPython.parallel import Client
cli = Client()
cli.ids

cli = Client()
dview=cli[:]

with dview.sync_imports():
    import numpy as np
    import os
    from datetime import timedelta
    import pandas as pd

#We also need to add the time deltas and output list into the function as 
#local variables as well as add the Ipython.parallel decorator

@dview.parallel(block=True)
def work(df):
    before = timedelta(hours = 8)
    after = timedelta(minutes = 1)
    output = []

final time 1:17:54.910206, about 1/4 original time

I would still be very interested for anyone to suggest small speed improvements within the body of the function.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top