Question

I would like to subsample a data frame at hourly intervals from a datetime column, beginning with the time value in the first row of the data frame. My data frame runs at 10-minute intervals from the first to the last row. Example data is below:

structure(list(datetime = structure(1:19, .Label = c("30/03/2011 05:09", 
"30/03/2011 05:19", "30/03/2011 05:29", "30/03/2011 05:39", "30/03/2011 05:49", 
"30/03/2011 05:59", "30/03/2011 06:09", "30/03/2011 06:19", "30/03/2011 06:29", 
"30/03/2011 06:39", "30/03/2011 06:49", "30/03/2011 06:59", "30/03/2011 07:09", 
"30/03/2011 07:19", "30/03/2011 07:29", "30/03/2011 07:39", "30/03/2011 07:49", 
"30/03/2011 07:59", "30/03/2011 08:09"), class = "factor"), a_count = c(66L, 
34L, 33L, 20L, 12L, 44L, 36L, 29L, 21L, 22L, 17L, 38L, 24L, 19L, 
60L, 54L, 27L, 36L, 45L), b_count = c(166.49, 167.54, 168.31, 
168.81, 169.24, 169.61, 169.96, 170.29, 170.63, 170.98, 171.31, 
171.62, 171.94, 172.29, 172.68, 173.15, 173.71, 174.34, 174.99
)), .Names = c("datetime", "a_count", "b_count"), class = "data.frame", row.names = c(NA, 
-19L))

df

           datetime a_count b_count
1  30/09/2011 05:09      66  166.49
2  30/09/2011 05:19      34  167.54
3  30/09/2011 05:29      33  168.31
4  30/09/2011 05:39      20  168.81
5  30/09/2011 05:49      12  169.24
6  30/09/2011 05:59      44  169.61
7  30/09/2011 06:09      36  169.96
8  30/09/2011 06:19      29  170.29
9  30/09/2011 06:29      21  170.63
10 30/09/2011 06:39      22  170.98
11 30/09/2011 06:49      17  171.31
12 30/09/2011 06:59      38  171.62
13 30/09/2011 07:09      24  171.94
14 30/09/2011 07:19      19  172.29
15 30/09/2011 07:29      60  172.68
16 30/09/2011 07:39      54  173.15
17 30/09/2011 07:49      27  173.71
18 30/09/2011 07:59      36  174.34
19 30/09/2011 08:09      45  174.99

I would like to end up with the following data frame:

        datetime   a_count b_count
30/09/2011 05:09       66  166.49
30/09/2011 06:09       36  169.96
30/09/2011 07:09       24  171.94
30/09/2011 08:09       45  174.99

Any suggestions would be greatly appreciated!

Was it helpful?

Solution

It is hard to guess what structure you have. Is it guaranteed that you have one value at exactly the first time value + x times 60 minutes? What happens if the value can not be found? What happens if you have two values at that time. Do you need approximate matching? Say, 09:10 is counted as 09:09?

On idea to get you started is the following:

# I will call your dataframe `d`. 
# Transform datetime to a POSIXct object, R's datatype for timestamps
d$datetime <- as.POSIXct(as.character(d$datetime), format='%d/%m/%Y %H:%M')
# Extract the minutes
d$minute <- as.numeric(format(d$datetime, '%M'))
# And select by identical minute.
subset(d, minute == d$minute[1])

OTHER TIPS

> df$datetime <- strptime(df$datetime, format = "%d/%m/%Y %H:%M")                                                                                                                                                                           
> df$dif <- c(0, cumsum(as.numeric(diff(df$datetime))))                                                                                                                                                                                     
>                                                                                                                                                                                                                                           
> df[df$dif %% 60 == 0,]                                                                                                                                                                                                              

               datetime a_count b_count dif
  2011-03-30 05:09:00      66  166.49   0
  2011-03-30 06:09:00      36  169.96  60
  2011-03-30 07:09:00      24  171.94 120
  2011-03-30 08:09:00      45  174.99 180

I have the same questions as Thilo, but heres another solution.

You can also use the lubridate packages to change the format of your times which may be a bit more intutitive and easy to remember.

Also, you can add variables based on the hour, and then summarize how you would like with plyr.

in the example below I took the sum and mean of a_count. May need to vary based on your purpose.

library(plyr)
library(lubridate)

df2 <- mutate(df, dt = dmy_hm(as.character(datetime)), hour = hour(dt), minute = minute(dt))
summary <- ddply(df2, .(hour), summarize, a_mean = mean(a_count), a_sum = sum(a_count))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top