Frage

This question asks about aggregation by time period in R, what pandas calls resampling. The most useful answer uses the XTS package to group by a given time period, applying some function such as sum() or mean().

One of the comments suggested there was something similar in lubridate, but didn't elaborate. Can someone provide an idiomatic example using lubridate? I've read through the lubridate vignette a couple times and can imagine some combination of lubridate and plyr, however I want to make sure there isn't an easier way that I'm missing.

To make the example more real, let's say I want the daily sum of bicycles traveling northbound from this dataset:

library(lubridate)
library(reshape2)

bikecounts <- read.csv(url("http://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD"), header=TRUE, stringsAsFactors=FALSE)
names(bikecounts) <- c("Date", "Northbound", "Southbound")

Data looks like this:

> head(bikecounts)
                    Date Northbound Southbound
1 10/02/2012 12:00:00 AM          0          0
2 10/02/2012 01:00:00 AM          0          0
3 10/02/2012 02:00:00 AM          0          0
4 10/02/2012 03:00:00 AM          0          0
5 10/02/2012 04:00:00 AM          0          0
6 10/02/2012 05:00:00 AM          0          0
War es hilfreich?

Lösung

I don't know why you'd use lubridate for this. If you're just looking for something less awesome than xts you could try this

tapply(bikecounts$Northbound, as.Date(bikecounts$Date, format="%m/%d/%Y"), sum)

Basically, you just need to split by Date, then apply a function.


lubridate could be used for creating a grouping factor for split-apply problems. So, for example, if you want the sum for each month (ignoring year)

tapply(bikecounts$Northbound, month(mdy_hms(bikecounts$Date)), sum)

But, it's just using wrappers for base R functions, and in the case of the OP, I think the base R function as.Date is the easiest (as evidenced by the fact that the other Answers also ignored your request to use lubridate ;-) ).


Something that wasn't covered by the Answer to the other Question linked to in the OP is split.xts. period.apply splits an xts at endpoints and applies a function to each group. You can find endpoints that are useful for a given task with the endpoints function. For example, if you have an xts object, x, then endpoints(x, "months") would give you the row numbers that are the last row of each month. split.xts leverages that to split an xts object -- split(x, "months") would return a list of xts objects where each component was for a different month.

Although, split.xts() and endpoints() are primarily intended for xts objects, they also work on some other objects as well, including plain time based vectors. Even if you don't want to use xts objects, you still may find uses for endpoints() because of its convenience or its speed (implemented in C)

> split.xts(as.Date("1970-01-01") + 1:10, "weeks")
[[1]]
[1] "1970-01-02" "1970-01-03" "1970-01-04"

[[2]]
[1] "1970-01-05" "1970-01-06" "1970-01-07" "1970-01-08" "1970-01-09"
[6] "1970-01-10" "1970-01-11"

> endpoints(as.Date("1970-01-01") + 1:10, "weeks")
[1]  0  3 10

I think lubridate's best use in this problem is for parsing the "Date" strings into POSIXct objects. i.e. the mdy_hms function in this case.

Here's an xts solution that uses lubridate to parse the "Date" strings.

x <- xts(bikecounts[, -1], mdy_hms(bikecounts$Date))
period.apply(x, endpoints(x, "days"), sum)
apply.daily(x, sum) # identical to above

For this specific task, xts also has an optimized period.sum function (written in Fortran) that is very fast

period.sum(x, endpoints(x, "days"))

Andere Tipps

Using ddply from plyr package:

library(plyr)
bikecounts$Date<-with(bikecounts,as.Date(Date, format = "%m/%d/%Y"))
x<-ddply(bikecounts,.(Date),summarise, sumnorth=sum(Northbound),sumsouth=sum(Southbound))


 > head(x)
        Date sumnorth sumsouth
1 2012-10-02     1165      773
2 2012-10-03     1761     1760
3 2012-10-04     1767     1708
4 2012-10-05     1590     1558
5 2012-10-06      926     1080
6 2012-10-07      951     1191


 > tail(x)
          Date sumnorth sumsouth
298 2013-07-26     1964     1999
299 2013-07-27     1212     1289
300 2013-07-28      902     1078
301 2013-07-29     2040     2048
302 2013-07-30     2314     2226
303 2013-07-31     2008     2076

Here is an option using data.table after importing the csv:

library(data.table)

# convert the data.frame to data.table
bikecounts <- data.table(bikecounts)

# Calculate
bikecounts[, list(NB=sum(Northbound), SB=sum(Southbound)), by=as.Date(Date, format="%m/%d/%Y")]

        as.Date   NB   SB
  1: 2012-10-02 1165  773
  2: 2012-10-03 1761 1760
  3: 2012-10-04 1767 1708
  4: 2012-10-05 1590 1558
  5: 2012-10-06  926 1080
 ---                     
299: 2013-07-27 1212 1289
300: 2013-07-28  902 1078
301: 2013-07-29 2040 2048
302: 2013-07-30 2314 2226
303: 2013-07-31 2008 2076

Note, you can also use fread() ("fast read") from the data.table package to read in the CSV into a data.table in one step. The only draw back is you to manually convert the date/time from string.

eg: 
bikecounts <- fread("http://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD", header=TRUE, stringsAsFactors=FALSE)
setnames(bikecounts, c("Date", "Northbound", "Southbound"))
bikecounts[, Date := as.POSIXct(D, format="%m/%d/%Y %I:%M:%S %p")] 

Here is the requested lubridate solution, which I also added to the linked question. It uses a combination of lubridate and zoo aggregate() for these operations:

ts.month.sum <- aggregate(zoo.ts, month, sum)

ts.daily.mean <- aggregate(zoo.ts, day, mean)

ts.mins.mean <- aggregate(zoo.ts, minutes, mean)

Obviously, you need to first convert your data to a zoo() object, which is easy enough. You can also use yearmon() or yearqtr(), or custom functions for both split and apply. This method is as syntactically sweet as that of pandas.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top