Speed up pandas aggregation

https://stackoverflow.com/questions/23545834

18-07-2023
|

Question

I'm trying to count duplicate rows in pandas DataFrame. I read data from a csv file that looks like this

feature, IV, IT
early/J_result/N, True, False
early/J_result/N, True, False
early/J_result/N, True, False
excellent/J_result/N, True, True
hillsdown/N, True, False
hillsdown/N, True, False

The desired output for the example input above is

feature, IV, IT, count
early/J_result/N, True, False, 3
excellent/J_result/N, True, True, 1
hillsdown/N, True, False, 2

The code I have now is:

import pandas as pd
def sum_up_token_counts(hdf_file):
    df = pd.read_csv(csv_file, sep=', ')
    counts = df.groupby('feature').count().feature
    assert counts.sum() == df.shape[0]  # no missing rows
    df = df.drop_duplicates()
    df.set_index('feature', inplace=True)
    df['count'] = counts
    return df

This works as expected, but takes a long time. I profiled it and it looks like almost all of the time is spent grouping and counting.

Total time: 4.43439 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    28                                           
    29         1        57567  57567.0      1.3      df = pd.read_csv(hdf_file, sep=', ')
    30         1      4368529 4368529.0     98.5      counts = df.groupby('feature').count().feature
    31         1          174    174.0      0.0      assert counts.sum() == df.shape[0]  # no missing rows
    32         1         6234   6234.0      0.1      df = df.drop_duplicates()
    33         1          501    501.0      0.0      df.set_index('feature', inplace=True)
    34         1         1377   1377.0      0.0      df['count'] = counts
    35         1            1      1.0      0.0      return df

Any ideas how this piece of code could be sped up?

Solution

use master/0.14 (coming shortly), vastly speeds up count, see here

here's the benchmark on master/0.14 vs 0.13.1:

Setup

In [1]: n = 10000

In [2]: offsets = np.random.randint(n, size=n).astype('timedelta64[ns]')

In [3]: dates = np.datetime64('now') + offsets

In [4]: dates[np.random.rand(n) > 0.5] = np.datetime64('nat')

In [5]: offsets[np.random.rand(n) > 0.5] = np.timedelta64('nat')

In [6]: value2 = np.random.randn(n)

In [7]: value2[np.random.rand(n) > 0.5] = np.nan

In [8]: obj = pd.util.testing.choice(['a', 'b'], size=n).astype(object)

In [9]: obj[np.random.randn(n) > 0.5] = np.nan

In [10]: df = DataFrame({'key1': np.random.randint(0, 500, size=n),
   ....:                 'key2': np.random.randint(0, 100, size=n),
   ....:                 'dates': dates,
   ....:                 'value2' : value2,
   ....:                 'value3' : np.random.randn(n),
   ....:                 'obj': obj,
   ....:                 'offsets': offsets})

v0.13.1

In [11]: %timeit df.groupby(['key1', 'key2']).count()
1 loops, best of 3: 5.41 s per loop

v0.14.0

In [11]: %timeit df.groupby(['key1', 'key2']).count()
100 loops, best of 3: 6.25 ms per loop

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow