speed up sapply to find if time difference exceeds threshold

https://stackoverflow.com//questions/24027386

21-12-2019
|

Вопрос

I need sapply to return a list of booleans based on whether or not a time difference exceeds a certain threshold (in my case, a number of days set by a for loop).

Sample data (dates have been converted using as.Date):

#DF called "held"
ID  Result  Start_Date
123 0   12/5/2013
123 0   12/12/2013
123 0   12/31/2013
123 0   4/22/2014
123 1   4/23/2014
654 0   9/3/2013
654 0   9/17/2013
98  0   10/18/2013
98  0   10/19/2013
98  2   12/20/2013
555 0   2/1/2014
555 0   3/2/2014
555 0   3/3/2014
66  1   1/12/2013

Code:

#empty vectors to be populated for plotting
a <- c()
b <- c()
for (n in 1:60){
#all rows where ID is not duplicated and Result is either 1 or 2 are FALSE
#all ID's where the difference between the min and max Start_Date (across multiple rows) exceeds the threshold are TRUE
  held$CHNS <-((!(!(held$ID %in% held$ID[duplicated(held$ID) | duplicated(held$ID, fromLast = TRUE)])&(held$Result %in% c(1,2)))) & (sapply(held$ID,function(x) max(held$Start_Date[held$ID == x]) - min(held$Start_Date[held$ID == x]) > n)))
#find percentage of Results 1 and 2 in entire CHNS population
  m <- length(held$Result[held$Result %in% c(1,2) & held$CHNS == TRUE])/nrow(held[held$CHNS == TRUE,])
#assign vector elements
  a[n] <- n
  b[n] <- m
}

The current code seems to be accurate, but it's extremely slow. Any tips on how to improve? Should I even be using sapply? Thank you!

Решение

This can be vectorised quite nicely, as shown below.

held <- read.table(text=
  'ID  Result  Start_Date
  123 0   12/5/2013
  123 0   12/12/2013
  123 0   12/31/2013
  123 0   4/22/2014
  123 1   4/23/2014
  654 0   9/3/2013
  654 0   9/17/2013
  98  0   10/18/2013
  98  0   10/19/2013
  98  2   12/20/2013
  555 0   2/1/2014
  555 0   3/2/2014
  555 0   3/3/2014
  66  1   1/12/2013', header=TRUE)

held$Start_Date <- as.Date(held$Start_Date, '%m/%d/%Y')

# Add a column giving the number of days spanned for the ID
held$date.diff <- with(held, {
  ndays <- tapply(Start_Date, ID, function(x) diff(range(x)))
  ndays[match(ID, names(ndays))]
})

sapply(1:60, function(n) {
  with(held, {
    rule1 <- !duplicated(ID) & Result %in% 1:2
    rule2 <- date.diff  > n
    outcome <- !rule1 & rule2
    sum(outcome & Result %in% 1:2) / sum(outcome)
  })
})

#  [1] 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462
#  [8] 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462 0.1818182
# [15] 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182
# [22] 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182
# [29] 0.1818182 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000
# [36] 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000
# [43] 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000
# [50] 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000
# [57] 0.2500000 0.2500000 0.2500000 0.2500000

A quick benchmark:

microbenchmark(jbaums(), userNaN())

# Unit: milliseconds
#         expr        min         lq     median         uq        max neval
#     jbaums()   1.994695   2.110046   2.164258   2.223137   3.685502   100
#    userNaN() 110.448790 112.985603 114.911328 117.714080 489.052823   100

Другие советы

For one thing, I would find the difference for each ID first, outside the loop. Then just do the difference check in a loop from 1:60 if you have to. I would also use dplyr to calculate the differences, which should simplify the code a lot and probably make it faster. Using your example:

require(dplyr)
ID <- group_by(held, ID)
Diff <- summarise(ID, Difference = (max(Start_Date) - min(Start_Date)))


a <- 1:60
b <- vector('numeric', 60)

for n in (1:60) {
b[n] <- mean (Diff$Difference > n) 
}

That should give you a vector b with the percentage of time the difference in the population was larger than each level of n.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow