Frage

I am looking for an implementation of union for time intervals which is capable of dealing with unions that are not themselves intervals.

I have noticed lubridate includes a union function for time intervals but it always returns a single interval even if the union is not an interval (ie it returns the interval defined by the minimum of both start dates and the maximum of both end dates, ignoring intervening periods not covered by either interval):

library(lubridate)
int1 <- new_interval(ymd("2001-01-01"), ymd("2002-01-01"))
int2 <- new_interval(ymd("2003-06-01"), ymd("2004-01-01"))
union(int1, int2)
# Union includes intervening time between intervals.
# [1] 2001-01-01 UTC--2004-01-01 UTC

I have also looked at the interval package, but its documentation makes no reference to union.

My end goal is to use the complex union with %within%:

my_int %within% Reduce(union, list_of_intervals)

So if we consider a concrete example, suppose the list_of_intervals is:

[[1]] 2000-01-01 -- 2001-01-02 
[[2]] 2001-01-01 -- 2004-01-02 
[[3]] 2005-01-01 -- 2006-01-02 

Then my_int <- 2001-01-01 -- 2004-01-01 is not %within% the list_of_intervals so it should return FALSE and my_int <- 2003-01-01 -- 2006-01-01 is so it should be TRUE.

However, I suspect the complex union has more uses than this.

War es hilfreich?

Lösung

If I understand your question correctly, you'd like to start with a set of potentially overlapping intervals and obtain a list of intervals that represents the UNION of the input set, rather than just the single interval spanning the mininum and maximum of the input set. This is the same question I had.

A similar question was asked at: Union of intervals

... but the accepted response fails with overlapping intervals. However, hosolmaz (I am new to SO, so don't know how to link to this user) posted a modification (in Python) that fixes the issue, which I then converted to R as follows:

library(dplyr) # for %>%, arrange, bind_rows

interval_union <- function(input) {
  if (nrow(input) == 1) {
    return(input)
  }
  input <- input %>% arrange(start)
  output = input[1, ]
  for (i in 2:nrow(input)) {
    x <- input[i, ]
    if (output$stop[nrow(output)] < x$start) {
      output <- bind_rows(output, x)
    } else if (output$stop[nrow(output)] == x$start) {
      output$stop[nrow(output)] <- x$stop
    }
    if (x$stop > output$stop[nrow(output)]) {
      output$stop[nrow(output)] <- x$stop
    }
  }
  return(output)
}

With your example with overlapping and non-contiguous intervals:

d <- as.data.frame(list(
  start = c('2005-01-01', '2000-01-01', '2001-01-01'),
  stop = c('2006-01-02', '2001-01-02', '2004-01-02')),
  stringsAsFactors = FALSE)

This produces:

> d
       start       stop
1 2005-01-01 2006-01-02
2 2000-01-01 2001-01-02
3 2001-01-01 2004-01-02

> interval_union(d)
       start       stop
1 2000-01-01 2004-01-02
2 2005-01-01 2006-01-02

I am a relative novice to R programming, so if anyone could convert the interval_union() function above to accept as parameters not only the input data frame, but also the names of the 'start' and 'stop' columns to use so the function could be more easily re-usable, that'd be great.

Andere Tipps

Well, in the example you provided, the union of int1 and int2 could be seen just as a vector with the two intervals :

int1 <- new_interval(ymd("2001-01-01"), ymd("2002-01-01"))
int2 <- new_interval(ymd("2003-06-01"), ymd("2004-01-01"))
ints <- c(int1,int2)

%within% works on vectors, so you can do something like this :

my_int <- new_interval(ymd("2001-01-01"), ymd("2004-01-01"))
my_int %within% ints
# [1]  TRUE FALSE

So you can check if your interval is in one of the intervals of your list with any :

any(my_int %within% ints)
# [1] TRUE

Your comment is right, the results given by %within% doesn't seem coherent with the documentation, which says :

If a is an interval, both its start and end dates must fall within b to return TRUE.

If I look at the source code of %within% when a and b are both intervals, it seems to be the following :

setMethod("%within%", signature(a = "Interval", b = "Interval"), function(a,b){
    as.numeric(a@start) - as.numeric(b@start) <= b@.Data & as.numeric(a@start) - as.numeric(b@start) >= 0
})

So it seems that only the starting point of a is tested against b, and it looks coherent with the results. Maybe this should be considered as a bug and should be reported ?

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top