If I understand your question correctly, you'd like to start with a set of potentially overlapping intervals and obtain a list of intervals that represents the UNION of the input set, rather than just the single interval spanning the mininum and maximum of the input set. This is the same question I had.
A similar question was asked at: Union of intervals
... but the accepted response fails with overlapping intervals. However, hosolmaz (I am new to SO, so don't know how to link to this user) posted a modification (in Python) that fixes the issue, which I then converted to R as follows:
library(dplyr) # for %>%, arrange, bind_rows
interval_union <- function(input) {
if (nrow(input) == 1) {
return(input)
}
input <- input %>% arrange(start)
output = input[1, ]
for (i in 2:nrow(input)) {
x <- input[i, ]
if (output$stop[nrow(output)] < x$start) {
output <- bind_rows(output, x)
} else if (output$stop[nrow(output)] == x$start) {
output$stop[nrow(output)] <- x$stop
}
if (x$stop > output$stop[nrow(output)]) {
output$stop[nrow(output)] <- x$stop
}
}
return(output)
}
With your example with overlapping and non-contiguous intervals:
d <- as.data.frame(list(
start = c('2005-01-01', '2000-01-01', '2001-01-01'),
stop = c('2006-01-02', '2001-01-02', '2004-01-02')),
stringsAsFactors = FALSE)
This produces:
> d
start stop
1 2005-01-01 2006-01-02
2 2000-01-01 2001-01-02
3 2001-01-01 2004-01-02
> interval_union(d)
start stop
1 2000-01-01 2004-01-02
2 2005-01-01 2006-01-02
I am a relative novice to R programming, so if anyone could convert the interval_union() function above to accept as parameters not only the input data frame, but also the names of the 'start' and 'stop' columns to use so the function could be more easily re-usable, that'd be great.