extracting segments of a data.table

https://stackoverflow.com/questions/12237018

29-06-2021
|

Question

I have a data.table and I need to extract equal length segments starting at various row locations. What is the easiest way to do this? For example:

x <- data.table(a=sample(1:1000,100), b=sample(1:1000,100))
r <- c(1,2,10,20,44)
idx <- lapply(r, function(i) {j <-which(x$a == i); if (length(j)>0) {return(j)} })
y <- lapply(idx, function(i) {if (!is.null(i)) x[i:(i+5)]})
do.call(rbind, y)
    a   b
1:  44  63
2:  96 730
3: 901 617
4: 446 370
5: 195 341
6: 298 411

This is certainly not the data.table way of doing things so I was hoping there is a better way?

EDIT: Per comments below, I edit this just so it's clear that the values in a are not necessarily contiguous nor do they correspond to the row number.

Solution

Not sure whether you already know the row positions, or if you want to search for them. Either way, this should cover both.

require(data.table)
set.seed(1)
DT = data.table(a=sample(1:1000,20), b=sample(1:1000,20))
setkey(DT,a)
DT
#       a   b
#  1:  62 338
#  2: 175 593
#  3: 201 267
#  4: 204 478
#  5: 266 935
#  6: 372 212
#  7: 374 711
#  8: 380 184
#  9: 491 659
# 10: 572 651
# 11: 625 863
# 12: 657 380
# 13: 679 488
# 14: 707 782
# 15: 760 816
# 16: 763 404
# 17: 894 385
# 18: 906 126
# 19: 940  14
# 20: 976 107
r = c(201,380,760)
starts = DT[J(r),which=TRUE]  # binary search for items
                              # skip if the starting row numbers are known
starts
# [1]  3  8 15

Option 1: make the row number sequences, concatenate, and do one lookup in DT (no need for keys or binary search just to select by row numbers) :

DT[unlist(lapply(starts,seq.int,length=5))]
#       a   b
#  1: 201 267
#  2: 204 478
#  3: 266 935
#  4: 372 212
#  5: 374 711
#  6: 380 184
#  7: 491 659
#  8: 572 651
#  9: 625 863
# 10: 657 380
# 11: 760 816
# 12: 763 404
# 13: 894 385
# 14: 906 126
# 15: 940  14

Option 2: make a list of data.table subsets and then rbind them together. This is less efficient than option 1, but for completeness :

L = lapply(starts,function(i)DT[seq.int(i,i+4)])
L
# [[1]]
#      a   b
# 1: 201 267
# 2: 204 478
# 3: 266 935
# 4: 372 212
# 5: 374 711
# 
# [[2]]
#      a   b
# 1: 380 184
# 2: 491 659
# 3: 572 651
# 4: 625 863
# 5: 657 380
# 
# [[3]]
#      a   b
# 1: 760 816
# 2: 763 404
# 3: 894 385
# 4: 906 126
# 5: 940  14

rbindlist(L)   # more efficient that do.call("rbind",L). See ?rbindlist.
#       a   b
#  1: 201 267
#  2: 204 478
#  3: 266 935
#  4: 372 212
#  5: 374 711
#  6: 380 184
#  7: 491 659
#  8: 572 651
#  9: 625 863
# 10: 657 380
# 11: 760 816
# 12: 763 404
# 13: 894 385
# 14: 906 126
# 15: 940  14

OTHER TIPS

I think that this should be a better way and according to the 10 minute introduction to data.table, that's a binary search and therefore preferable:

library(data.table)
x <- data.table(a=1:100, b=1:100, key="a")
r <- c(1,2,10,20,44)
vec <- numeric()
for (elem in r) {
  vec <- c(vec, seq(from=elem, by=1, length.out=6))
}
x[data.table(vec)]
     a  b
 1:  1  1
 2:  2  2
 3:  3  3
 4:  4  4
 5:  5  5
 6:  6  6
 7:  2  2
...

Note that I first set column a as the key and then create an inner data.table to join with that column a. The creation of vec is probably not the best way, but that shouldn't be the bottleneck.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow