Pregunta

Background:

I have a pandas Dataframe with some ~200k+ rows of data.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 212812 entries, 0 to 212811
Data columns (total 10 columns):
date         212812  non-null values
animal_id    212812  non-null values
lons         212812  non-null values
lats         212812  non-null values
depth        212812  non-null values
prey1        212812  non-null values
prey2        212812  non-null values
prey3        212812  non-null values
dist         212812  non-null values
sog          212812  non-null values
dtypes: float64(9), int64(1), object(1)

For each date, there are 1000 individuals with lon/lat positions.

I would like to calculate the daily change in distance for each individual, which I had successfully done for 100 individuals using pyproj.Geod.inv , but the increase in population has slowed things down massively.

Question:

Is there an efficient way of performing calculations on a pandas dataframe using an external class method like pyproj.Geod.inv?

Example routine:

    ids = np.unique(data['animal_id'])

    for animal in ids:
        id_idx = data['animal_id']==animal
        dates = data['date'][id_idx]
        for i in range(len(dates)-1):
            idx1 = (data['animal_id']==id) & (data['date']==dates[i])
            idx2 = (data['animal_id']==id) & (data['date']==dates[i+1])
            lon1 = data['lons'][idx1]
            lat1 = data['lats'][idx1]
            lon2 = data['lons'][idx2]
            lat2 = data['lats'][idx2]
            fwd_az, bck_az, dist = g.inv(lon1,lat1,lon2,lat2)
            data['dist'][idx2] = dist
            data['sog'][idx2]  = dist/24. #dist/time(hours)
¿Fue útil?

Solución

I came up with I solution, but I'd really appreciate suggestions on alternate ways of doing this or perhaps a more efficient way of performing my solution.

I first used the pandas shift method to add shifted lon/lat columns (inspired by this SO question), so I could perform the calculations over a single row.

Then I used the pandas apply method (as was suggested here) to implement the pyproj.Geod.inv calculation, looping through slices of the pandas DataFrame for each individual in the population.

def calc_distspd(df):
    '''Broadcast pyproj distance calculation over pandas dataframe'''

    import pyproj
    import numpy as np

    def calcdist(x):
        '''Pandas broadcast function for pyproj distance calculations'''
        return g.inv(x['lons+1'], x['lats+1'], x['lons'], x['lats'])[2]

    # Define Earth ellipsoid for dist calculations
    g = pyproj.Geod(ellps='WGS84')

    # Create array of zeros to initialize new columns
    fill_data = np.zeros(df['date'].shape)

    # Create new columns for calculated vales
    df['dist'] = fill_data
    df['sog']  = fill_data
    df['lons+1'] = fill_data
    df['lats+1'] = fill_data

    # Get list of unique animal_ids
    animal_ids = np.unique(df.animal_id.values)

    # Peform function broadcast for each individual
    for animal_id in animal_ids:
        idx = df['animal_id']==animal_id

        # Add shifted position columns for dist calculations
        df['lons+1'] = df['lons'].shift(1) # lon+1 = origin position
        df['lats+1'] = df['lats'].shift(1) # lat+1 = origin position

        # Copy 1st position over shifted column nans to prevent error
        idx2 = (idx) & (np.isnan(df[lons+1]))
        df['lons+1'][idx2] = df['lons'][idx2]
        df['lats+1'][idx2] = df['lats'][idx2]

        df['dist'][idx] = df[idx].apply(calcdist, axis=1)
        df['sog'][idx]  = df['dist']/24. # Calc hourly speed

    # Remove shifted position columns from df
    del df['lons+1']
    del df['lats+1']

    return df
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top