Pandas subsampling

https://stackoverflow.com/questions/22180925

05-06-2023
|

Question

I have some event data that is measured in time, so the data format looks like

Time(s)    Pressure    Humidity
0             10            5 
0             9.9           5.1
0             10.1          5
1             10            4.9
2             11            6

Here the first column is Time elapsed since the start of the experiment, in seconds. The other two cols are some observations. A row is created when certain conditions are true, these conditions are beyond the scope of the discussion here. Each set of 3 numbers separated by a semi colon is a row of data. Since the lowest granularity of resolution in time here is only seconds, you could have two rows with the same timestamp but but will different observations. Basically these were two distinct events that time could not distinguish.

Now my problem is to roll up the data series, by subsampling it say every 10 or 100 seconds, or 1000 seconds. So I want a skimmed data series from the original higher granularity data series. There are a few ways to decide which row you would use, for instance say you are subsampling at every 10 seconds, when 10 seconds elapse, you could have multiple rows with the time stamp of 10 seconds. You could either take

1) first row
2) mean of all rows with the same timestamp of 10
3) some other technique

I am looking to do this in pandas, any ideas or way to start would be very appreciated. Thanks.

Solution

Here is a simple example that shows how to perform the operations requested with pandas.

One uses data binning to group samples and resample data.

import pandas as pd

# Creation of the dataframe
df = pd.DataFrame({\
'Time(s)':[0 ,0 ,0 ,1 ,2],\
'Pressure':[10, 9.9, 10.1, 10, 11],\
'Humidity':[5 ,5.1 ,5 ,4.9 ,6]})

# Select time increment
delta_t = 1

timeCol = 'Time(s)'
# Creation of the time sampling
v = xrange(df[timeCol].min()-delta_t,df[timeCol].max()+delta_t,delta_t)
# Pandas magic instructions with cut and groupby
df_binned = df.groupby(pd.cut(df[timeCol],v))
# Display the first element
dfFirst = df_binned.head(1)
# Evaluate the mean of each group
dfMean = df_binned.mean()
# Evaluate the median of each group
dfMedian = df_binned.median()
# Find the max of each group
dfMax = df_binned.max()
# Find the min of each group
dfMin = df_binned.min()

Result will look like this for dfFirst

           Humidity  Pressure  Time(s)
Time(s)
(-1, 0] 0       5.0        10        0
(0, 1]  3       4.9        10        1
(1, 2]  4       6.0        11        2

Result will look like this for dfMean

         Humidity  Pressure  Time(s)
Time(s)
(-1, 0]  5.033333        10        0
(0, 1]   4.900000        10        1
(1, 2]   6.000000        11        2

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow