Frage

I have a huge data file in long format-parts of it supplied below. Each ID can have several rows, where status is the final status. However I need to do the analysis with time varying covariates and hence need to create two new time variables and update the status variable. I´ve been struggling with this for some time now and I cannot figure out how to do this efficiently as there can be as many as four rows per ID. The time varying variable is NUM.AFTER.DIAG. If NUM.AFTER.DIAG==0 then it is easy, where time1=0 and time2=STATUSDATE. However when NUM.AFTER.DIAG==1 then I need to make a new row where time1=0, time2=DOB-DATE.DIAG and NUM.AFTER.DIAG=0 and also make sure STATUS="B". The second row would then be time1=time2 from the previous row and time2=STATUSDATE-DATE.DIAG-time1 from this row. Equally if there are more rows then the different rows needs to be subtracted from each other. Also if NUM.AFTER.DIAG==0 but there are multiple rows then all extra rows can be deleted.

Any ideas for an efficient solution to this? I´ve looked at john Fox unfold command, but it assumes that all the intervals are in wide format to begin with.

Edit: The table as requested. As for the censor variable: "D"=event (death)

enter image description here

 structure(list(ID = c(187L, 258L, 265L, 278L, 281L, 281L, 283L, 
    283L, 284L, 291L, 292L, 292L, 297L, 299L, 305L, 305L, 311L, 311L, 
    319L, 319L, 319L, 322L, 322L, 329L, 329L, 333L, 333L, 333L, 334L, 
    334L), STATUS = c("D", "B", "B", "B", "B", "B", "D", "D", "B", 
    "B", "B", "B", "D", "D", "D", "D", "B", "B", "B", "B", "B", "D", 
    "D", "B", "B", "D", "D", "D", "D", "D"), STATUSDATE = structure(c(11153, 
    15034, 15034, 15034, 15034, 15034, 5005, 5005, 15034, 15034, 
    15034, 15034, 6374, 5005, 7562, 7562, 15034, 15034, 15034, 15034, 
    15034, 7743, 7743, 15034, 15034, 4670, 4670, 4670, 5218, 5218
    ), class = "Date"), DATE.DIAG = structure(c(4578, 4609, 4578, 
    4487, 4670, 4670, 4517, 4517, 4640, 4213, 4397, 4397, 4397, 4487, 
    4213, 4213, 4731, 4731, 4701, 4701, 4701, 4397, 4397, 4578, 4578, 
    4275, 4275, 4275, 4456, 4456), class = "Date"), DOB = structure(c(NA, 
    13010, NA, NA, -1082, -626, 73, 1353, 13679, NA, 1626, 3087, 
    -626, -200, 2814, 3757, 1930, 3787, 6740, 13528, 14167, 5462, 
    6557, 7865, 9235, -901, -504, -108, -535, -78), class = "Date"), 
        NUM.AFTER.DIAG = c(0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 
        0, 0, 0, 0, 0, 1, 2, 3, 1, 2, 1, 2, 0, 0, 0, 0, 0)), .Names = c("ID", 
    "STATUS", "STATUSDATE", "DATE.DIAG", "DOB", "NUM.AFTER.DIAG"), row.names = c(NA, 
    30L), class = "data.frame")

EDIT: I did come up with a solution, although probably not very efficient.

  u1<-ddply(p,.(ID),function(x) {

  if (x$NUM.AFTER.DIAG==0){

    x$time1<-0
    x$time2<-x$STATUSDATE-x$DATE.DIAG
    x<-x[1,]

  }
  else {

      x<-rbind(x,x[1,])
      x<-x[order(x$DOB),]
      u<-max(x$NUM.AFTER.DIAG)
      x$NUM.AFTER.DIAG<-0:u
      x$time1[1]<-0
      x$time2[1:(u)]<-x$DOB[2:(u+1)]-x$DATE.DIAG[2:(u+1)]
      x$time2[u+1]<-x$STATUSDATE[u]-x$DATE.DIAG[u]
      x$time1[2:(u+1)]<-x$time2[1:u]
      x$STATUS[1:u]<-"B"
      }
  x
}
)
War es hilfreich?

Lösung

Ok, I've tried something, but I'm not sure I understand your transformation process entirely, so let me know if there are some mistakes. In general ddply will be slow (even when .parallel = TRUE), when there are many individuals, mainly because at the end it has to bring all the data sets of all individuals together and rbind (or rbind.fill) them, which takes forever for a multitude of data.frame objects.

So here's a suggestion, where dat.orig is your toy data set:

I would first split the task in two: 1) NUM.AFTER.DIAG == 0 2) NUM.AFTER.DIAG == 1

1) It seems that if NUM.AFTER.DIAG == 0, except of calculating time2 and extract first row if an ID occurs more than once (like ID 333), there is not much to do in part 1):

## erase multiple occurences
dat <- dat.orig[!(duplicated(dat.orig$ID) & dat.orig$NUM.AFTER.DIAG == 0), ]
dat0 <- dat[dat$NUM.AFTER.DIAG == 0, ]
dat0$time1 <- 0
dat0$time2 <- difftime(dat0$STATUSDATE, dat0$DATE.DIAG, unit = "days")
time.na <- is.na(dat0$DOB)
dat0$time1[time.na] <- dat0$time2[time.na] <- NA

> dat0
    ID STATUS STATUSDATE  DATE.DIAG        DOB NUM.AFTER.DIAG time1      time2
1  187      D 2000-07-15 1982-07-15       <NA>              0    NA    NA days
3  265      B 2011-03-01 1982-07-15       <NA>              0    NA    NA days
4  278      B 2011-03-01 1982-04-15       <NA>              0    NA    NA days
5  281      B 2011-03-01 1982-10-15 1967-01-15              0     0 10364 days
7  283      D 1983-09-15 1982-05-15 1970-03-15              0     0   488 days
10 291      B 2011-03-01 1981-07-15       <NA>              0    NA    NA days
11 292      B 2011-03-01 1982-01-15 1974-06-15              0     0 10637 days
13 297      D 1987-06-15 1982-01-15 1968-04-15              0     0  1977 days
14 299      D 1983-09-15 1982-04-15 1969-06-15              0     0   518 days
15 305      D 1990-09-15 1981-07-15 1977-09-15              0     0  3349 days
17 311      B 2011-03-01 1982-12-15 1975-04-15              0     0 10303 days
26 333      D 1982-10-15 1981-09-15 1967-07-15              0     0   395 days
29 334      D 1984-04-15 1982-03-15 1968-07-15              0     0   762 days

2) is a little trickier, but all you actually have to do is insert one more row and calculate the time variables:

## create subset with relevant observations
dat.unfold <- dat[dat$NUM.AFTER.DIAG != 0, ]
## compute time differences
time1 <- difftime(dat.unfold$DOB, dat.unfold$DATE.DIAG, unit = "days")
time1[time1 < 0] <- 0
time2 <- difftime(dat.unfold$STATUSDATE, dat.unfold$DATE.DIAG, unit = "days")

## calculate indices for individuals
n.obs <- daply(dat.unfold, .(ID), function(z) max(z$NUM.AFTER.DIAG) + 1)
df.new <- data.frame(ID = rep(unique(dat.unfold$ID), times = n.obs))
rle.new <- rle(df.new$ID)
ind.last <- cumsum(rle.new$lengths)
ind.first <- !duplicated(df.new$ID)
ind.first.w <- which(ind.first) 
ind.second <- ind.first.w + 1
ind2.to.last <- unlist(sapply(seq_along(ind.second), 
                function(z) ind.second[z]:ind.last[z]))

## insert time variables
df.new$time2 <- df.new$time1 <- NA
df.new$time1[ind.first] <- 0
df.new$time1[!ind.first] <- time1
df.new$time2[!ind.first] <- time2
df.new$time2[ind2.to.last - 1] <- time1

this gives me:

> df.new
    ID time1 time2
1  258     0  8401
2  258  8401 10425
3  284     0  9039
4  284  9039 10394
5  319     0  2039
6  319  2039  8827
7  319  8827  9466
8  319  9466 10333
9  322     0  1065
10 322  1065  2160
11 322  2160  3346
12 329     0  3287
13 329  3287  4657
14 329  4657 10456

This should work for the STATUS variable and the other variables in similar fashion. When both steps are working separately, you just have to do one rbind step at the end.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top