Question

I'm trying to find regions in a file that have consecutive lines based on two columns. I want to find the largest span of consecutive values. If column 4 (V3) comes immediately before the second line's value for column 3 (V2), then write the output for the longest span of consecutive values.

The input looks like this. input:

> x
   grp   V1   V2  V3  V4  V5 V6 
1:   1 DOG.1 142 144 132 134  0  
2:   2 DOG.1 313 315 303 305  0  
3:   3 DOG.1 316 318 306 308  0  
4:   4 DOG.1 319 321 309 311  0 
5:   5 DOG.1 322 324 312 314  0

the output should look like this:

      out.name  in  out  
[1,] "DOG.1" "313" "324"

Notice how the x[1,] was removed and how the output is starting at x[2,3] and ending at x[5,4]. All of these values are consecutive.

Was it helpful?

Solution

One obvious way is to take tail(x$V2, -1L) - head(x$V3, -1L) and get the start and end indices corresponding to the maximum consecutive 1s. But I'll skip it here (and leave it to others) as I'd like to show how this can be done with the help of IRanges package:

require(data.table)
require(IRanges) ## Bioconductor package

x.ir = reduce(IRanges(x$V2, x$V3))
max.idx = which.max(width(x.ir))

ans = data.table(out.name = "DOG.1", 
                 in = start(x.ir)[max.idx], 
                 out = end(x.ir)[max.idx])

#    out.name bla out
# 1:    DOG.1 313 324
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top