R - rolling window over data.table

https://stackoverflow.com/questions/23597735

r
data.table

20-07-2023
|

質問

I have the following data.table:

          time       id type   price      size  api start.point  end.point
 1: 1399672906 37119594  ASK 440.002 1.4840000 TRUE  1399672606 1399672906
 2: 1399672940 37119597  BID 441.000 0.1758830 TRUE  1399672640 1399672940
 3: 1399672940 37119598  BID 441.000 0.0491166 TRUE  1399672640 1399672940
 4: 1399673105 37119638  ASK 440.002 0.1313700 TRUE  1399672805 1399673105
 5: 1399673198 37119668  BID 441.000 0.0233013 TRUE  1399672898 1399673198
 6: 1399673198 37119669  BID 441.000 0.9744230 TRUE  1399672898 1399673198
 7: 1399673208 37119675  BID 441.000 0.1587060 TRUE  1399672908 1399673208
 8: 1399673208 37119676  BID 441.000 0.1238870 TRUE  1399672908 1399673208
 9: 1399673208 37119677  BID 441.001 0.0100000 TRUE  1399672908 1399673208
10: 1399673208 37119678  BID 441.175 0.0129740 TRUE  1399672908 1399673208
11: 1399673208 37119679  BID 441.192 0.0100000 TRUE  1399672908 1399673208
12: 1399673208 37119680  BID 441.399 0.0129740 TRUE  1399672908 1399673208
13: 1399673208 37119681  BID 441.499 1.7500000 TRUE  1399672908 1399673208
14: 1399673208 37119682  BID 441.500 8.0214600 TRUE  1399672908 1399673208
15: 1399673241 37119691  BID 441.500 0.0453001 TRUE  1399672941 1399673241
16: 1399673274 37119696  ASK 440.030 0.9133460 TRUE  1399672974 1399673274
17: 1399673360 37119705  BID 440.030 0.0580000 TRUE  1399673060 1399673360
18: 1399673433 37119709  ASK 440.002 0.0319611 TRUE  1399673133 1399673433
19: 1399673506 37119711  ASK 440.002 0.2618460 TRUE  1399673206 1399673506
20: 1399673507 37119712  BID 440.002 1.0000000 TRUE  1399673207 1399673507

where:

time is unix timestamp
id is a trade number as assigned by the exchange
start point = "time" less 5minutes
end.point = actually equals to variable "time"

The serie is not equidistant. Variables start.point and end.point actually create the 5 minute moving window ending at the variable "time". And I want to calculate the frequency of trades in the particular window.

I have it done with the for loop:

for (i in 1:nrow(trades)){

  trades[i, freq := length(unique(trades[time >= start.point[i] & time <= end.point[i]]$id))]

  setTxtProgressBar(status.bar, i)

}

However, I'm wondering if there is some more "fashionable" data.table way. I tried something like:

trades[, freq := list(length(unique(trades[time >= start.point & time <= end.point,]$id))), by = list(id)]

But the resuls are wrong, it seems it doesn't work on "line-per-line" basis:

            time       id type   price       size  api start.point  end.point freq
  1: 1399672906 37119594  ASK 440.002  1.4840000 TRUE  1399672606 1399672906  100
  2: 1399672940 37119597  BID 441.000  0.1758830 TRUE  1399672640 1399672940  100
  3: 1399672940 37119598  BID 441.000  0.0491166 TRUE  1399672640 1399672940  100
  4: 1399673105 37119638  ASK 440.002  0.1313700 TRUE  1399672805 1399673105  100
  5: 1399673198 37119668  BID 441.000  0.0233013 TRUE  1399672898 1399673198  100
  6: 1399673198 37119669  BID 441.000  0.9744230 TRUE  1399672898 1399673198  100
  7: 1399673208 37119675  BID 441.000  0.1587060 TRUE  1399672908 1399673208  100
  8: 1399673208 37119676  BID 441.000  0.1238870 TRUE  1399672908 1399673208  100
  9: 1399673208 37119677  BID 441.001  0.0100000 TRUE  1399672908 1399673208  100
 10: 1399673208 37119678  BID 441.175  0.0129740 TRUE  1399672908 1399673208  100
 11: 1399673208 37119679  BID 441.192  0.0100000 TRUE  1399672908 1399673208  100

UPDATE:

see the structure below:

structure(list(time = c(1399672906L, 1399673105L, 1399673274L, 
1399673433L, 1399673506L, 1399673531L), id = c(37119594L, 37119638L, 
37119696L, 37119709L, 37119711L, 37119717L), type = c("ASK", 
"ASK", "ASK", "ASK", "ASK", "ASK"), price = c(440.002, 440.002, 
440.03, 440.002, 440.002, 440), size = c(1.484, 0.13137, 0.913346, 
0.0319611, 0.261846, 3.168), api = c(TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE), start.point = c(1399672606, 1399672805, 1399672974, 
1399673133, 1399673206, 1399673231), end.point = c(1399672906L, 
1399673105L, 1399673274L, 1399673433L, 1399673506L, 1399673531L
), freq = c(1L, 4L, 13L, 14L, 13L, 11L)), .Names = c("time", 
"id", "type", "price", "size", "api", "start.point", "end.point", 
"freq"), sorted = c("type", "time"), class = c("data.table", 
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000002e50788>)

解決

I think this can be best accomplished using bioconductor package IRanges package for now, until interval joins / range joins are implemented in data.table.

require(IRanges)
ir1 = IRanges(trades$time, width=1L)
ir2 = IRanges(trades$start.point, trades$end.point)

olaps = findOverlaps(ir1, ir2, type = "within")
dt = data.table(queryHits(olaps), subjectHits(olaps))[, .N, by=V2]

trades[dt$V2, freq := dt$N]

#          time       id type   price      size  api start.point  end.point freq
# 1: 1399672906 37119594  ASK 440.002 1.4840000 TRUE  1399672606 1399672906    1
# 2: 1399673105 37119638  ASK 440.002 0.1313700 TRUE  1399672805 1399673105    2
# 3: 1399673274 37119696  ASK 440.030 0.9133460 TRUE  1399672974 1399673274    2
# 4: 1399673433 37119709  ASK 440.002 0.0319611 TRUE  1399673133 1399673433    2
# 5: 1399673506 37119711  ASK 440.002 0.2618460 TRUE  1399673206 1399673506    3
# 6: 1399673531 37119717  ASK 440.000 3.1680000 TRUE  1399673231 1399673531    4

HTH

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow