Date differences between specific events in R

Question 1

Here is a solution in 1 line using the ddply function from the plyr package and the lubridate package to parse the dates.

Code:

library(plyr)
library(lubridate)

new_df <- ddply(.data=df, .variables=c('id'), summarize,
                days=round(ymd_hms(t[match('R',e)])-ymd_hms(t[match('A',e)]),1))
new_df

Output:

   id      days
1 086 10.9 days
2 115   NA days
3 522   NA days
4 524  2.3 days
5 638  3.2 days
6 836  1.8 days

Note that there are 2 warnings because the ids 115 and 522 do not have a value for the e variable.

If you want the date difference to be a decimal value, you can use the as.double function, like so:

Basically, I am using the match function to find the first occurrence of A and R, parsing the date variable with the ymd_hms function from the lubridate package, and then finding the difference of the two dates. I round it to 1 decimal place, and then convert it into a double for display.

EDIT

After reading the OPs comments, here is a rather ugly way to get the desired result. Forgive me, it is early in the morning, and it may not be elegant or efficient, but it seems to output the desired result.

Code:

grouper <- function(var, group) {
  num <- 1
  res <- c(1:length(var))
  for(i in 1:length(var)) {
    res[i] <- num
    if(var[i]==group) {
      num <- num+1
    }
  }
  return(res)
}

df2 <- df
df2$group <- ddply(.data=df, .variables='id', summarize, group=grouper(e,'R'))$group

df3 <- ddply(.data=df2, .variables=c('id','group'), summarize,
             days=round(ymd_hms(t[match('R',e)])-ymd_hms(t[match('A',e)]),1))

df3[complete.cases(df3),-2]

Output:

    id      days
1  086 10.9 days
6  524  2.3 days
7  524  2.5 days
9  638  3.2 days
10 638  9.6 days
12 836  1.8 days
13 836  4.8 days
14 836 11.3 days
16 836  1.7 days

The idea is to add another column that groups the rows by the occurrence of an 'R' event, so that I can subset the data set by both ID and 'R' event. It is kind of hacky, and I am sure there are more elegant ways to do it.

Now, I'm off to get some coffee.

Question 2

No need for anything, but basic R. Order your data.frame, choose your "first" appearances and finally use aggregate similar to what you use:

df <- df[do.call(order, df), ]
df <- df[!duplicated(df[, c("id", "e")]), ]
tdiff <- function(x) {
  if(length(x) == 2) {
     rv <- as.numeric(difftime(strptime(x[2], format="%Y-%m-%d %H:%M:%S"),
                               strptime(x[1], format="%Y-%m-%d %H:%M:%S"),
                               units = "days"))
  } else {
     rv <- NA
  }
  rv
}

rv <- aggregate(df$t, by = list(id = df$id), tdiff)

Just for the sake of closure as you don't need it anymore, here is the version that works the way you want.

df <- df[do.call(order, df), ]
df_a <- subset(df, e == "A")
df_a <- df_a[!duplicated(df_a[, c("id", "e")]), ]
df_r <- subset(df, e == "R")
df_r[, 'A'] <- df_a[match(df_r$id, df_a$id), 't']
df_r[, 'R_A'] <- as.numeric(difftime(strptime(df_r[, 't'], format="%Y-%m-%d %H:%M:%S"),
                           strptime(df_r[, 'A'], format="%Y-%m-%d %H:%M:%S"),
                           units = "days"))
rv <- df_r[, c('id', 'R_A')]
rv[!is.na(rv$R_A) & rv$R_A < 0, 'R_A'] <- NA
rv <- rv[!duplicated(rv), ]

Question 3

Here is one approach

df <- transform(df, t=as.POSIXct(t))
sp <- split(df, df$id)
calc_diff <- function(x) {
    start <- min(subset(x, e=="A")$t)
    end <- min(subset(x, e=="R")$t)
    return(end-start)
}
sapply(sp, FUN=calc_diff)