Ok, I've tried something, but I'm not sure I understand your transformation process entirely, so let me know if there are some mistakes. In general ddply
will be slow (even when .parallel = TRUE
), when there are many individuals, mainly because at the end it has to bring all the data sets of all individuals together and rbind
(or rbind.fill
) them, which takes forever for a multitude of data.frame
objects.
So here's a suggestion, where dat.orig
is your toy data set:
I would first split the task in two:
1) NUM.AFTER.DIAG == 0
2) NUM.AFTER.DIAG == 1
1) It seems that if NUM.AFTER.DIAG == 0
, except of calculating time2 and extract first row if an ID occurs more than once (like ID 333), there is not much to do in part 1):
## erase multiple occurences
dat <- dat.orig[!(duplicated(dat.orig$ID) & dat.orig$NUM.AFTER.DIAG == 0), ]
dat0 <- dat[dat$NUM.AFTER.DIAG == 0, ]
dat0$time1 <- 0
dat0$time2 <- difftime(dat0$STATUSDATE, dat0$DATE.DIAG, unit = "days")
time.na <- is.na(dat0$DOB)
dat0$time1[time.na] <- dat0$time2[time.na] <- NA
> dat0
ID STATUS STATUSDATE DATE.DIAG DOB NUM.AFTER.DIAG time1 time2
1 187 D 2000-07-15 1982-07-15 <NA> 0 NA NA days
3 265 B 2011-03-01 1982-07-15 <NA> 0 NA NA days
4 278 B 2011-03-01 1982-04-15 <NA> 0 NA NA days
5 281 B 2011-03-01 1982-10-15 1967-01-15 0 0 10364 days
7 283 D 1983-09-15 1982-05-15 1970-03-15 0 0 488 days
10 291 B 2011-03-01 1981-07-15 <NA> 0 NA NA days
11 292 B 2011-03-01 1982-01-15 1974-06-15 0 0 10637 days
13 297 D 1987-06-15 1982-01-15 1968-04-15 0 0 1977 days
14 299 D 1983-09-15 1982-04-15 1969-06-15 0 0 518 days
15 305 D 1990-09-15 1981-07-15 1977-09-15 0 0 3349 days
17 311 B 2011-03-01 1982-12-15 1975-04-15 0 0 10303 days
26 333 D 1982-10-15 1981-09-15 1967-07-15 0 0 395 days
29 334 D 1984-04-15 1982-03-15 1968-07-15 0 0 762 days
2) is a little trickier, but all you actually have to do is insert one more row and calculate the time variables:
## create subset with relevant observations
dat.unfold <- dat[dat$NUM.AFTER.DIAG != 0, ]
## compute time differences
time1 <- difftime(dat.unfold$DOB, dat.unfold$DATE.DIAG, unit = "days")
time1[time1 < 0] <- 0
time2 <- difftime(dat.unfold$STATUSDATE, dat.unfold$DATE.DIAG, unit = "days")
## calculate indices for individuals
n.obs <- daply(dat.unfold, .(ID), function(z) max(z$NUM.AFTER.DIAG) + 1)
df.new <- data.frame(ID = rep(unique(dat.unfold$ID), times = n.obs))
rle.new <- rle(df.new$ID)
ind.last <- cumsum(rle.new$lengths)
ind.first <- !duplicated(df.new$ID)
ind.first.w <- which(ind.first)
ind.second <- ind.first.w + 1
ind2.to.last <- unlist(sapply(seq_along(ind.second),
function(z) ind.second[z]:ind.last[z]))
## insert time variables
df.new$time2 <- df.new$time1 <- NA
df.new$time1[ind.first] <- 0
df.new$time1[!ind.first] <- time1
df.new$time2[!ind.first] <- time2
df.new$time2[ind2.to.last - 1] <- time1
this gives me:
> df.new
ID time1 time2
1 258 0 8401
2 258 8401 10425
3 284 0 9039
4 284 9039 10394
5 319 0 2039
6 319 2039 8827
7 319 8827 9466
8 319 9466 10333
9 322 0 1065
10 322 1065 2160
11 322 2160 3346
12 329 0 3287
13 329 3287 4657
14 329 4657 10456
This should work for the STATUS
variable and the other variables in similar fashion.
When both steps are working separately, you just have to do one rbind
step at the end.