R: Using several criteria for the Aggregate function
-
11-12-2019 - |
Pregunta
I am searching for a solution how to use the aggregate function to sum up a column given several criteria in other columns. R should select a range in a column and executean operation in the same rows considering the value from another row.
The practical problem I am trying to solve is following: I got a list of electricity load measured every 15 minutes of a the day for every day over 2 years. It looks like this:
Date ______Time ______ Load
01-01-2010 00:00-00:15 1234
01-01-2010 00:15-00:30 2313
01-01-2010 ...
01-01-2010 23:30-23:45 2341
...
31-12-2011 23:30-23:45 2347
My aim is to compute the so called "Peak-Load" and the "Off-Peak-Load" The Peak is from 8 am to 8 pm. Off-Peak is the Opposite. So I want to calculate the Peak and Off-Peak for every day. Hence, I need aggregate for every day 8:00 to 20:00 and calculate the remaining load of the day.
I am also hap
Thanks for your help!
best, F
Solución
I think your mental model of a heirarchy here is making this way too complicated. You don't have to subset by day and then by peak/off-peak. Just subset jointly.
Using ddply
:
dat <- data.frame(date=rep(seq(5),5),time=runif(25),load=rnorm(25))
library(plyr)
dat$peak <- dat$time<.5
ddply(dat, .(date,peak), function(x) mean(x$load) )
> ddply(dat, .(date,peak), function(x) mean(x$load) )
date peak V1
1 1 FALSE -1.064166845
2 1 TRUE 0.172868201
3 2 FALSE 0.638594830
4 2 TRUE 0.045538051
5 3 FALSE 0.201264770
6 3 TRUE 0.054019462
7 4 FALSE 0.722268759
8 4 TRUE -0.490305933
9 5 FALSE 0.003411591
10 5 TRUE 0.628566966
Using aggregate
:
> aggregate(dat$load, list(dat$date,dat$peak), mean )
Group.1 Group.2 x
1 1 FALSE -1.064166845
2 2 FALSE 0.638594830
3 3 FALSE 0.201264770
4 4 FALSE 0.722268759
5 5 FALSE 0.003411591
6 1 TRUE 0.172868201
7 2 TRUE 0.045538051
8 3 TRUE 0.054019462
9 4 TRUE -0.490305933
10 5 TRUE 0.628566966
And just for the fun of it, benchmarks
First, using 5x5 entries as above:
> microbenchmark(
+ ddply(dat, .(date,peak), function(x) mean(x$load) ),
+ aggregate(dat$load, list(dat$date,dat$peak), mean )
+ )
Unit: milliseconds
expr min lq median uq max
1 aggregate(dat$load, list(dat$date, dat$peak), mean) 1.323438 1.376635 1.445769 1.549663 2.853348
2 ddply(dat, .(date, peak), function(x) mean(x$load)) 4.057177 4.292442 4.386289 4.534728 6.864962
Next using 500x500 entries
> m
Unit: milliseconds
expr min lq median uq max
1 aggregate(dat$load, list(dat$date, dat$peak), mean) 558.9524 570.7354 590.4633 599.4404 634.3201
2 ddply(dat, .(date, peak), function(x) mean(x$load)) 317.7781 348.1116 361.7118 413.4490 503.8540
50x50 benchmarks
n <- 50
dat <- data.frame(date=rep(seq(n),n),time=runif(n),load=rnorm(n))
dat$peak <- dat$time<.5
library(plyr)
library(microbenchmark)
library(data.table)
DT <- as.data.table(dat)
m <- microbenchmark(
ddply(dat, .(date,peak), function(x) mean(x$load) ),
aggregate(dat$load, list(dat$date,dat$peak), mean ),
DT[,.Internal(mean(load)),keyby=list(date,peak)]
)
m
plot(m)
So aggregate is faster for small problems (presumably because it has less overhead to load up all the machinery), and ddply is faster for large problems (where speed matters). Data.table blows everything away (as usual).