Domanda

I've spent quite some time trying to calcule date differences between specific events. For example, I have the data set below, where t is the event datetime, e is the event ('A' or 'R') and id is an identifier.

t                   e   id
2013-06-15 20:59:17 A   086
2013-06-26 18:02:09 R   086
2013-06-27 03:17:07 A   086
2013-06-01 11:34:09 R   115
2013-06-16 19:59:08 R   522
2013-06-16 20:05:04 R   522
2013-06-07 09:31:15 A   524
2013-06-09 16:24:04 R   524
2013-06-14 03:38:09 A   524
2013-06-16 15:49:09 R   524
2013-06-21 03:54:19 A   524
2013-06-12 12:34:37 A   638
2013-06-12 13:15:27 A   638
2013-06-15 16:12:23 R   638
2013-06-18 22:05:03 A   638
2013-06-28 13:30:20 R   638
2013-06-28 22:20:12 A   638
2013-06-01 18:34:46 A   836
2013-06-01 18:44:53 A   836
2013-06-03 14:35:09 R   836
2013-06-03 20:25:51 A   836
2013-06-03 20:27:25 A   836
2013-06-03 20:32:17 A   836
2013-06-08 16:22:07 R   836
2013-06-12 13:12:21 A   836
2013-06-12 13:15:16 A   836
2013-06-12 13:18:53 A   836
2013-06-12 18:59:24 A   836
2013-06-23 21:14:12 R   836
2013-06-24 20:16:11 R   836
2013-06-25 03:34:29 A   836
2013-06-26 20:33:11 R   836

Would like to make a new data set where, for each id, I would have the datetime difference (in days) between the 'A' events and the 'R' events.

For each id: if there are not events A before E then compute nothing. If there is three events A before two events E then compute the diff days with the first event A and the first event E. If there is only events A or E then compute nothing. Then, for the id=086, 115, 638 I would like to get this duples (id, days) (086, 10.9) (638, 3.2) (638, 9.6). The id=115 doesn't appears or could it be NaN beacause it has only one event R and no events A before it.

id  days 
086 10.9  (Ex. 2013-06-26 18:02:09 - 2013-06-15 20:59:17)
524 2.3
524 2.5
638 3.2
638 9.6
836 1.8
836 4.8   (Ex. 2013-06-08 16:22:07 - 2013-06-03 20:25:51)
836 11.3
836 1.7

I'm using this code as a first approach:

aggregate(as.POSIXct(df$t), list(df$id), diff)

Any ideas as to how to do this? Thanks in advance.

A dput of dataframe is:

> dput(df)
structure(list(t = c("2013-06-15 20:59:17", "2013-06-26 18:02:09", 
"2013-06-27 03:17:07", "2013-06-01 11:34:09", "2013-06-16 19:59:08", 
"2013-06-16 20:05:04", "2013-06-07 09:31:15", "2013-06-09 16:24:04", 
"2013-06-14 03:38:09", "2013-06-16 15:49:09", "2013-06-21 03:54:19", 
"2013-06-12 12:34:37", "2013-06-12 13:15:27", "2013-06-15 16:12:23", 
"2013-06-18 22:05:03", "2013-06-28 13:30:20", "2013-06-28 22:20:12", 
"2013-06-01 18:34:46", "2013-06-01 18:44:53", "2013-06-03 14:35:09", 
"2013-06-03 20:25:51", "2013-06-03 20:27:25", "2013-06-03 20:32:17", 
"2013-06-08 16:22:07", "2013-06-12 13:12:21", "2013-06-12 13:15:16", 
"2013-06-12 13:18:53", "2013-06-12 18:59:24", "2013-06-23 21:14:12", 
"2013-06-24 20:16:11", "2013-06-25 03:34:29", "2013-06-26 20:33:11"
), e = c("A", "R", "A", "R", "R", "R", "A", "R", "A", "R", "A", 
"A", "A", "R", "A", "R", "A", "A", "A", "R", "A", "A", "A", "R", 
"A", "A", "A", "A", "R", "R", "A", "R"), id = c("086", "086", 
"086", "115", "522", "522", "524", "524", "524", "524", "524", 
"638", "638", "638", "638", "638", "638", "836", "836", "836", 
"836", "836", "836", "836", "836", "836", "836", "836", "836", 
"836", "836", "836")), .Names = c("t", "e", "id"), row.names = c(855945L, 
1481100L, 1508045L, 16944L, 920490L, 921005L, 349201L, 494172L, 
746450L, 904442L, 1163757L, 653045L, 654357L, 834901L, 1047932L, 
1583218L, 1613753L, 36421L, 37178L, 139968L, 162274L, 162417L, 
162804L, 430725L, 654254L, 654350L, 654453L, 670726L, 1333676L, 
1384583L, 1401293L, 1491782L), class = "data.frame")
È stato utile?

Soluzione

Here is a solution in 1 line using the ddply function from the plyr package and the lubridate package to parse the dates.

Code:

library(plyr)
library(lubridate)

new_df <- ddply(.data=df, .variables=c('id'), summarize,
                days=round(ymd_hms(t[match('R',e)])-ymd_hms(t[match('A',e)]),1))
new_df

Output:

   id      days
1 086 10.9 days
2 115   NA days
3 522   NA days
4 524  2.3 days
5 638  3.2 days
6 836  1.8 days

Note that there are 2 warnings because the ids 115 and 522 do not have a value for the e variable.

If you want the date difference to be a decimal value, you can use the as.double function, like so:

Basically, I am using the match function to find the first occurrence of A and R, parsing the date variable with the ymd_hms function from the lubridate package, and then finding the difference of the two dates. I round it to 1 decimal place, and then convert it into a double for display.

EDIT

After reading the OPs comments, here is a rather ugly way to get the desired result. Forgive me, it is early in the morning, and it may not be elegant or efficient, but it seems to output the desired result.

Code:

grouper <- function(var, group) {
  num <- 1
  res <- c(1:length(var))
  for(i in 1:length(var)) {
    res[i] <- num
    if(var[i]==group) {
      num <- num+1
    }
  }
  return(res)
}

df2 <- df
df2$group <- ddply(.data=df, .variables='id', summarize, group=grouper(e,'R'))$group

df3 <- ddply(.data=df2, .variables=c('id','group'), summarize,
             days=round(ymd_hms(t[match('R',e)])-ymd_hms(t[match('A',e)]),1))

df3[complete.cases(df3),-2]

Output:

    id      days
1  086 10.9 days
6  524  2.3 days
7  524  2.5 days
9  638  3.2 days
10 638  9.6 days
12 836  1.8 days
13 836  4.8 days
14 836 11.3 days
16 836  1.7 days

The idea is to add another column that groups the rows by the occurrence of an 'R' event, so that I can subset the data set by both ID and 'R' event. It is kind of hacky, and I am sure there are more elegant ways to do it.

Now, I'm off to get some coffee.

Altri suggerimenti

No need for anything, but basic R. Order your data.frame, choose your "first" appearances and finally use aggregate similar to what you use:

df <- df[do.call(order, df), ]
df <- df[!duplicated(df[, c("id", "e")]), ]
tdiff <- function(x) {
  if(length(x) == 2) {
     rv <- as.numeric(difftime(strptime(x[2], format="%Y-%m-%d %H:%M:%S"),
                               strptime(x[1], format="%Y-%m-%d %H:%M:%S"),
                               units = "days"))
  } else {
     rv <- NA
  }
  rv
}

rv <- aggregate(df$t, by = list(id = df$id), tdiff)

Just for the sake of closure as you don't need it anymore, here is the version that works the way you want.

df <- df[do.call(order, df), ]
df_a <- subset(df, e == "A")
df_a <- df_a[!duplicated(df_a[, c("id", "e")]), ]
df_r <- subset(df, e == "R")
df_r[, 'A'] <- df_a[match(df_r$id, df_a$id), 't']
df_r[, 'R_A'] <- as.numeric(difftime(strptime(df_r[, 't'], format="%Y-%m-%d %H:%M:%S"),
                           strptime(df_r[, 'A'], format="%Y-%m-%d %H:%M:%S"),
                           units = "days"))
rv <- df_r[, c('id', 'R_A')]
rv[!is.na(rv$R_A) & rv$R_A < 0, 'R_A'] <- NA
rv <- rv[!duplicated(rv), ]

Here is one approach

df <- transform(df, t=as.POSIXct(t))
sp <- split(df, df$id)
calc_diff <- function(x) {
    start <- min(subset(x, e=="A")$t)
    end <- min(subset(x, e=="R")$t)
    return(end-start)
}
sapply(sp, FUN=calc_diff)
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top