What is the most efficient way of processing a flow data.frame like
> df <- data.frame(amount=c(4,3,1,1,4,5,9,13,1,1), size=c(164,124,131,315,1128,331,1135,13589,164,68), tot=1, first=c(1,1,3,3,2,2,2,2,4,4), secs=c(2,2,0,0,1,1,1,1,0,0))
> df
amount size tot first secs
1 4 164 1 1 2
2 3 124 1 1 2
3 1 131 1 3 0
4 1 315 1 3 0
5 4 1128 1 2 1
6 5 331 1 2 1
7 9 1135 1 2 1
8 13 13589 1 2 1
9 1 164 1 4 0
10 1 68 1 4 0
to an per-time aggregated totals
> df2
time tot amount size
1 1 2 3.5 144
2 2 6 34.5 16327
3 3 8 36.5 16773
4 4 2 2.0 232
.. using R, when the actual data-set can be more than 100 000 000 rows or even tens of gigabytes?
Column first
denotes the start of a flow with duration secs
, with metrics amount
, size
, and tot
. In aggregated totals the size
and amount
are evenly divided to the time range in double-precision, whereas tot
is summed to every time-slot as an integer. Duration secs
denotes how many time-slots the flows last in addition to value first
: If secs
is 1 and first
is 5, the flow lasts time-slots 5 and 6. My current implementation uses ugly and dead-slow for-loops, which is not an option:
df2 = data.frame()
for (i in 1:nrow(df)) {
items <- df[i, 'secs']
idd <- df[i, 'first']
for (ss in 0:items) { # run once for secs=0
if (items == 0) { items <- 1 }
df2[idd+ss, 'time'] <- idd+ss
if (is.null(df2[idd+ss, 'tot']) || is.na(df2[idd+ss, 'tot'])) {
df2[idd+ss, 'tot'] <- df[i, 'tot']
} else {
df2[idd+ss, 'tot'] <- df2[idd+ss, 'tot'] + df[i, 'tot']
}
if (is.null(df2[idd+ss, 'amount']) || is.na(df2[idd+ss, 'amount'])) {
df2[idd+ss, 'amount'] <- df[i, 'amount']/items
} else {
df2[idd+ss, 'amount'] <- df2[idd+ss, 'amount'] + df[i, 'amount']/items
}
if (is.null(df2[idd+ss, 'size']) || is.na(df2[idd+ss, 'size'])) {
df2[idd+ss, 'size'] <- df[i, 'size']/items
} else {
df2[idd+ss, 'size'] <- df2[idd+ss, 'size'] + df[i, 'size']/items
}
}
}
You can probably optimize this a lot and achieve good performance using only loops, but I bet that better algorithms exist. Maybe you could somehow expand/duplicate the rows with secs > 0
, while increasing the first
(timestamp) values of the expanded rows and adjust amount
, size
, and tot
metrics on the fly:
now original data..
amount size tot first secs
1 4 164 1 1 0
2 4 164 1 1 1
3 3 124 1 1 2
magically becomes
amount size tot first
1 4 164 1 1
2 2 82 1 1
3 2 82 1 2
4 1 41.33 1 1
5 1 41.33 1 2
6 1 41.33 1 3
After this pre-processing step aggregation would be trivial using plyr ddply, of course in efficient parallel mode.
All example ddply, apply etc. function examples I was able to find operate on per-row or per-column basis, making it hard to modify other rows. Hopefully I don't have to rely on awk-magic.
Update: The mentioned algorithm can easily exhaust your memory when the expansion is done "as is". Some kind of "on the fly" calculation is thus preferred, where we don't map everything to memory. Mattrition's answer is however correct and helped a lot, so marking it as the accepted answer.