R data.frame flow data pre-processing for aggregated time statistics

https://stackoverflow.com/questions/23677129

23-07-2023
|

Domanda

What is the most efficient way of processing a flow data.frame like

> df <- data.frame(amount=c(4,3,1,1,4,5,9,13,1,1), size=c(164,124,131,315,1128,331,1135,13589,164,68), tot=1, first=c(1,1,3,3,2,2,2,2,4,4), secs=c(2,2,0,0,1,1,1,1,0,0))
> df
  amount  size   tot first secs
1      4   164     1     1    2
2      3   124     1     1    2
3      1   131     1     3    0
4      1   315     1     3    0
5      4  1128     1     2    1
6      5   331     1     2    1
7      9  1135     1     2    1
8     13 13589     1     2    1
9      1   164     1     4    0
10     1    68     1     4    0

to an per-time aggregated totals

> df2
  time tot amount  size
1    1   2    3.5   144
2    2   6   34.5 16327
3    3   8   36.5 16773
4    4   2    2.0   232

.. using R, when the actual data-set can be more than 100 000 000 rows or even tens of gigabytes?

Column first denotes the start of a flow with duration secs, with metrics amount, size, and tot. In aggregated totals the size and amount are evenly divided to the time range in double-precision, whereas tot is summed to every time-slot as an integer. Duration secs denotes how many time-slots the flows last in addition to value first: If secs is 1 and first is 5, the flow lasts time-slots 5 and 6. My current implementation uses ugly and dead-slow for-loops, which is not an option:

df2 = data.frame()
for (i in 1:nrow(df)) {

  items <- df[i, 'secs']
  idd <- df[i, 'first']

  for (ss in 0:items) {  # run once for secs=0
    if (items == 0) { items <- 1 }

    df2[idd+ss, 'time'] <- idd+ss

    if (is.null(df2[idd+ss, 'tot']) || is.na(df2[idd+ss, 'tot'])) {
      df2[idd+ss, 'tot'] <- df[i, 'tot']
    } else {
      df2[idd+ss, 'tot'] <- df2[idd+ss, 'tot'] + df[i, 'tot']
    }

    if (is.null(df2[idd+ss, 'amount']) || is.na(df2[idd+ss, 'amount'])) {
      df2[idd+ss, 'amount'] <- df[i, 'amount']/items
    } else {
      df2[idd+ss, 'amount'] <- df2[idd+ss, 'amount'] + df[i, 'amount']/items
    }

    if (is.null(df2[idd+ss, 'size']) || is.na(df2[idd+ss, 'size'])) {
      df2[idd+ss, 'size'] <- df[i, 'size']/items
    } else {
      df2[idd+ss, 'size'] <- df2[idd+ss, 'size'] + df[i, 'size']/items
    }

  }
}

You can probably optimize this a lot and achieve good performance using only loops, but I bet that better algorithms exist. Maybe you could somehow expand/duplicate the rows with secs > 0, while increasing the first (timestamp) values of the expanded rows and adjust amount, size, and tot metrics on the fly:

now original data..

  amount  size   tot first secs
1      4   164     1     1    0
2      4   164     1     1    1
3      3   124     1     1    2


magically becomes

  amount  size   tot first
1      4   164     1     1
2      2    82     1     1
3      2    82     1     2
4      1 41.33     1     1
5      1 41.33     1     2
6      1 41.33     1     3

After this pre-processing step aggregation would be trivial using plyr ddply, of course in efficient parallel mode.

All example ddply, apply etc. function examples I was able to find operate on per-row or per-column basis, making it hard to modify other rows. Hopefully I don't have to rely on awk-magic.

Update: The mentioned algorithm can easily exhaust your memory when the expansion is done "as is". Some kind of "on the fly" calculation is thus preferred, where we don't map everything to memory. Mattrition's answer is however correct and helped a lot, so marking it as the accepted answer.

Soluzione

The following is an implementation using data.table. I chose data.table for its aggregation abilities, but it's a nifty and efficient class to work with too.

library(data.table)

dt <- as.data.table(df)

# Using the "expand" solution linked in the Q. 
# +1 to secs to allow room for 0-values
dtr <- dt[rep(seq.int(1, nrow(dt)), secs+1)] 

# Create a new seci column that enumerates sec for each row of dt
dtr[,seci := dt[,seq(0,secs),by=1:nrow(dt)][,V1]]

# All secs that equal 0 are changed to 1 for later division
dtr[secs==0, secs := 1]

# Create time (first+seci) and adjusted amount and size columns
dtr[,c("time", "amount2", "size2") := list(first+seci, amount/secs, size/secs)]

# Aggregate selected columns (tot, amount2, and size2) by time
dtr.a <- dtr[,list(tot=sum(tot), amount=sum(amount2), size=sum(size2)), by=time]


dtr.a
   time tot amount  size
1:    1   2    3.5   144
2:    2   6   34.5 16327
3:    3   8   36.5 16773
4:    4   2    2.0   232

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow