Find maximum from combination of two tables (for-loop too slow)
-
02-06-2021 - |
Question
I have a data table "the.data", where the first column indicate a measurement instrument, and the rest different measured data.
instrument <- c(1,2,3,4,5,1,2,3,4,5)
hour <- c(1,1,1,1,1,2,2,2,2,2)
da <- c(12,14,11,14,10,19,15,16,13,11)
db <- c(21,23,22,29,28,26,24,27,26,22)
the.data <- data.frame(instrument,hour,da,db)
I also have defined groups of instruments, where for example group 1 (g1) refers to instruments 1 and 2.
g1 <- c(1,2)
g2 <- c(4,3,1)
g3 <- c(1,5,2)
g4 <- c(2,4)
g5 <- c(5,3,1,2,6)
groups <- c("g1","g2","g3","g4","g5")
I need to find out at which hour the sum of each group has maximum per data type, and its sum.
g1 hour 1: sum(da)=12+14=26 g1 hour 2: sum(da)=19+15=34
So, for g1 and da the answer is hour 2 and value 34.
I did this with a for-loop within a for-loop, but it takes too long time (I interrupted after a few hours). The issue is that the.data is about 100.000 rows long and that there are about 5.000 groups with 2-50 instruments each.
What can be a good method to do this?
Sincere thanks to all contributors to Stack-overflow.
Update: Now only five groups in examples.
/Chris
Solution
The group
loop will have to stay, or at best be replaced by something like lapply()
. The hour
loop, however, can be totally replaced by reformatting to an instrument x hour
matrix and then just doing vectorized algebra. For example:
library(reshape2)
groups = list(g1, g3)
the.data.a = dcast(the.data[,1:3], instrument ~ hour)
> sapply(groups, function(x) data.frame(max = max(colSums(the.data.a[x, -1])),
ind = which.max(colSums(the.data.a[x, -1]))))
[,1] [,2]
max 34 45
ind 2 2
OTHER TIPS
Here's a slightly modified version of John Colby's answer, with some sample data.
set.seed(21)
instrument <- sample(100, 1e5, TRUE)
hour <- sample(24, 1e5, TRUE)
da <- trunc(runif(1e5)*10)
db <- trunc(runif(1e5)*10)
the.data <- data.frame(instrument,hour,da,db)
groups <- replicate(5000, sample(100, sample(50,1)))
names(groups) <- paste("g",1:length(groups),sep="")
library(reshape2)
system.time({
the.data.a <- dcast(the.data[,1:3], instrument ~ hour, sum)
out <- t(sapply(groups, function(i) {
byHour <- colSums(the.data.a[i,-1])
c(max(byHour), which.max(byHour))
}))
colnames(out) <- c("max.hour","max.sum")
})
# Using da as value column: use value.var to override.
# user system elapsed
# 3.80 0.00 3.81
Here's one approach using plyr
and reshape2
from Hadley. First, we'll add some boolean values to the.data
depending on whether or not the instrument is in that group. Then we'll melt it into long format, subset out the rows we don't need, and then do a group by operation with ddply
or data.table
.
#add boolean columns
the.data <- transform(the.data,
g1 = instrument %in% g1,
g2 = instrument %in% g2,
g3 = instrument %in% g3,
g4 = instrument %in% g4,
g5 = instrument %in% g5
)
#load library
library(reshape2)
#melt into long format
the.data.m <- melt(the.data, id.vars = 1:4)
#subset out data that that has FALSE for the groupings
the.data.m <- subset(the.data.m, value == TRUE)
#load plyr and data.table
library(plyr)
library(data.table)
#plyr way
ddply(the.data.m, c("variable", "hour"), summarize, out = sum(da))
#data.table way
dt <- data.table(the.data.m)
dt[, list(out = sum(da)), by = "variable, hour"]
Do some benchmarking to see which is faster:
library(rbenchmark)
f1 <- function() ddply(the.data.m, c("variable", "hour"), summarize, out = sum(da))
f2 <- function() dt[, list(out = sum(da)), by = "variable, hour"]
> benchmark(f1(), f2(), replications=1000, order="elapsed", columns = c("test", "elapsed", "relative"))
test elapsed relative
2 f2() 3.44 1.000000
1 f1() 6.82 1.982558
So, data.table is about 2x faster for this example. Your miles may vary.
And just to show that it's giving right values:
> dt[, list(out = sum(da)), by = "variable, hour"]
variable hour out
[1,] g1 1 26
[2,] g1 2 34
[3,] g2 1 25
[4,] g2 2 29
...
You didn't provide your code (or a programmatic way to generate the groups, which would seem to be needed with a group count of 5000) but this may be a more effective use of R:
groups <- list(g1,g2,g3,g4,g5)
gmax <- list()
# The "da" results
for( gitem in seq_along(groups) ) {
gmax[[gitem]] <- with( subset(the.data , instrument %in% groups[[gitem]]),
tapply(da , hour, sum) ) }
damat <- matrix(c(sapply(gmax, which.max),
sapply(gmax, max)) , ncol=2)
# The "db" results
for( gitem in seq_along(groups) ) {
gmax[[gitem]] <- with( subset(the.data , instrument %in% groups[[gitem]]),
tapply(db , hour, sum) ) }
dbmat <- matrix(c(sapply(gmax, which.max),
sapply(gmax, max)) , ncol=2)
#--------
> damat
[,1] [,2]
[1,] 2 34
[2,] 2 29
[3,] 2 45
[4,] 1 14
[5,] 2 42
> dbmat
[,1] [,2]
[1,] 2 50
[2,] 2 53
[3,] 1 72
[4,] 1 29
[5,] 1 73