workflow for creating timevarying covariates in r

Question

Ok, I've tried something, but I'm not sure I understand your transformation process entirely, so let me know if there are some mistakes. In general ddply will be slow (even when .parallel = TRUE), when there are many individuals, mainly because at the end it has to bring all the data sets of all individuals together and rbind (or rbind.fill) them, which takes forever for a multitude of data.frame objects.

So here's a suggestion, where dat.orig is your toy data set:

I would first split the task in two: 1) NUM.AFTER.DIAG == 0 2) NUM.AFTER.DIAG == 1

1) It seems that if NUM.AFTER.DIAG == 0, except of calculating time2 and extract first row if an ID occurs more than once (like ID 333), there is not much to do in part 1):

## erase multiple occurences
dat <- dat.orig[!(duplicated(dat.orig$ID) & dat.orig$NUM.AFTER.DIAG == 0), ]
dat0 <- dat[dat$NUM.AFTER.DIAG == 0, ]
dat0$time1 <- 0
dat0$time2 <- difftime(dat0$STATUSDATE, dat0$DATE.DIAG, unit = "days")
time.na <- is.na(dat0$DOB)
dat0$time1[time.na] <- dat0$time2[time.na] <- NA

> dat0
    ID STATUS STATUSDATE  DATE.DIAG        DOB NUM.AFTER.DIAG time1      time2
1  187      D 2000-07-15 1982-07-15       <NA>              0    NA    NA days
3  265      B 2011-03-01 1982-07-15       <NA>              0    NA    NA days
4  278      B 2011-03-01 1982-04-15       <NA>              0    NA    NA days
5  281      B 2011-03-01 1982-10-15 1967-01-15              0     0 10364 days
7  283      D 1983-09-15 1982-05-15 1970-03-15              0     0   488 days
10 291      B 2011-03-01 1981-07-15       <NA>              0    NA    NA days
11 292      B 2011-03-01 1982-01-15 1974-06-15              0     0 10637 days
13 297      D 1987-06-15 1982-01-15 1968-04-15              0     0  1977 days
14 299      D 1983-09-15 1982-04-15 1969-06-15              0     0   518 days
15 305      D 1990-09-15 1981-07-15 1977-09-15              0     0  3349 days
17 311      B 2011-03-01 1982-12-15 1975-04-15              0     0 10303 days
26 333      D 1982-10-15 1981-09-15 1967-07-15              0     0   395 days
29 334      D 1984-04-15 1982-03-15 1968-07-15              0     0   762 days

2) is a little trickier, but all you actually have to do is insert one more row and calculate the time variables:

## create subset with relevant observations
dat.unfold <- dat[dat$NUM.AFTER.DIAG != 0, ]
## compute time differences
time1 <- difftime(dat.unfold$DOB, dat.unfold$DATE.DIAG, unit = "days")
time1[time1 < 0] <- 0
time2 <- difftime(dat.unfold$STATUSDATE, dat.unfold$DATE.DIAG, unit = "days")

## calculate indices for individuals
n.obs <- daply(dat.unfold, .(ID), function(z) max(z$NUM.AFTER.DIAG) + 1)
df.new <- data.frame(ID = rep(unique(dat.unfold$ID), times = n.obs))
rle.new <- rle(df.new$ID)
ind.last <- cumsum(rle.new$lengths)
ind.first <- !duplicated(df.new$ID)
ind.first.w <- which(ind.first) 
ind.second <- ind.first.w + 1
ind2.to.last <- unlist(sapply(seq_along(ind.second), 
                function(z) ind.second[z]:ind.last[z]))

## insert time variables
df.new$time2 <- df.new$time1 <- NA
df.new$time1[ind.first] <- 0
df.new$time1[!ind.first] <- time1
df.new$time2[!ind.first] <- time2
df.new$time2[ind2.to.last - 1] <- time1

this gives me:

> df.new
    ID time1 time2
1  258     0  8401
2  258  8401 10425
3  284     0  9039
4  284  9039 10394
5  319     0  2039
6  319  2039  8827
7  319  8827  9466
8  319  9466 10333
9  322     0  1065
10 322  1065  2160
11 322  2160  3346
12 329     0  3287
13 329  3287  4657
14 329  4657 10456

This should work for the STATUS variable and the other variables in similar fashion. When both steps are working separately, you just have to do one rbind step at the end.