Union of time intervals that are not necessarily contiguous

Question 1

If I understand your question correctly, you'd like to start with a set of potentially overlapping intervals and obtain a list of intervals that represents the UNION of the input set, rather than just the single interval spanning the mininum and maximum of the input set. This is the same question I had.

A similar question was asked at: Union of intervals

... but the accepted response fails with overlapping intervals. However, hosolmaz (I am new to SO, so don't know how to link to this user) posted a modification (in Python) that fixes the issue, which I then converted to R as follows:

library(dplyr) # for %>%, arrange, bind_rows

interval_union <- function(input) {
  if (nrow(input) == 1) {
    return(input)
  }
  input <- input %>% arrange(start)
  output = input[1, ]
  for (i in 2:nrow(input)) {
    x <- input[i, ]
    if (output$stop[nrow(output)] < x$start) {
      output <- bind_rows(output, x)
    } else if (output$stop[nrow(output)] == x$start) {
      output$stop[nrow(output)] <- x$stop
    }
    if (x$stop > output$stop[nrow(output)]) {
      output$stop[nrow(output)] <- x$stop
    }
  }
  return(output)
}

With your example with overlapping and non-contiguous intervals:

d <- as.data.frame(list(
  start = c('2005-01-01', '2000-01-01', '2001-01-01'),
  stop = c('2006-01-02', '2001-01-02', '2004-01-02')),
  stringsAsFactors = FALSE)

This produces:

> d
       start       stop
1 2005-01-01 2006-01-02
2 2000-01-01 2001-01-02
3 2001-01-01 2004-01-02

> interval_union(d)
       start       stop
1 2000-01-01 2004-01-02
2 2005-01-01 2006-01-02

I am a relative novice to R programming, so if anyone could convert the interval_union() function above to accept as parameters not only the input data frame, but also the names of the 'start' and 'stop' columns to use so the function could be more easily re-usable, that'd be great.

Question 2

Well, in the example you provided, the union of int1 and int2 could be seen just as a vector with the two intervals :

int1 <- new_interval(ymd("2001-01-01"), ymd("2002-01-01"))
int2 <- new_interval(ymd("2003-06-01"), ymd("2004-01-01"))
ints <- c(int1,int2)

%within% works on vectors, so you can do something like this :

my_int <- new_interval(ymd("2001-01-01"), ymd("2004-01-01"))
my_int %within% ints
# [1]  TRUE FALSE

So you can check if your interval is in one of the intervals of your list with any :

any(my_int %within% ints)
# [1] TRUE

Your comment is right, the results given by %within% doesn't seem coherent with the documentation, which says :

If a is an interval, both its start and end dates must fall within b to return TRUE.

If I look at the source code of %within% when a and b are both intervals, it seems to be the following :

setMethod("%within%", signature(a = "Interval", b = "Interval"), function(a,b){
    as.numeric(a@start) - as.numeric(b@start) <= b@.Data & as.numeric(a@start) - as.numeric(b@start) >= 0
})

So it seems that only the starting point of a is tested against b, and it looks coherent with the results. Maybe this should be considered as a bug and should be reported ?