Question

I have a well balanced panel data set which contains NA observations. I will be using LOCF, and would like to know how many consecutive NA's are in each panel, before carrying observations forward. LOCF is a procedure where by missing values can be "filled in" using the "last observation carried forward". This can make sense it some time-series applications; perhaps we have weather data in 5 minute increments: a good guess at the value of a missing observation might be an observation made 5 minutes earlier.

Obviously, it makes more sense to carry an observation forward one hour within one panel than it does to carry that same observation forward to the next year in the same panel.

I am aware that you can set a "maxgap" argument using zoo::na.locf, however, I want to get a better feel for my data. Please see a simple example:

require(data.table)
set.seed(12345)

### Create a "panel" data set
data <- data.table(id = rep(1:10, each = 10),
                   date = seq(as.POSIXct('2012-01-01'),
                              as.POSIXct('2012-01-10'),
                              by = '1 day'),
                   x  = runif(100))

### Randomly assign NA's to our "x" variable
na <- sample(1:100, size = 52)
data[na, x := NA]

### Calculate the max number of consecutive NA's by group...this is what I want:
### ID       Consecutive NA's
  #  1       1
  #  2       3
  #  3       3
  #  4       3
  #  5       4
  #  6       5
  #  ...
  #  10      2

### Count the total number of NA's by group...this is as far as I get:
data[is.na(x), .N, by = id]

All solutions are welcomed, but data.table solutions are highly preferred; the data file is large.

Was it helpful?

Solution

This will do it:

data[, max(with(rle(is.na(x)), lengths[values])), by = id]

I just ran rle to find all consecutive NA's and picked the max length.


Here's a rather convoluted answer to the comment question of recovering the date ranges for the above max:

data[, {
         tmp = rle(is.na(x));
         tmp$lengths[!tmp$values] = 0;  # modify rle result to ignore non-NA's
         n = which.max(tmp$lengths);    # find the index in rle of longest NA sequence

         tmp = rle(is.na(x));                   # let's get back to the unmodified rle
         start = sum(tmp$lengths[0:(n-1)]) + 1; # and find the start and end indices
         end   = sum(tmp$lengths[1:n]);

         list(date[start], date[end], max(tmp$lengths[tmp$values]))
       }, by = id]

OTHER TIPS

You can use rle with the modification suggested here (and pasted below) to count NA values.

foo  <- data[, rle(x), by=id]
foo[is.na(values), max(lengths), by=id]

#     id V1
# 1:  1  1
# 2:  2  3
# 3:  3  3
# 4:  4  3
# 5:  5  4
# 6:  6  5
# 7:  7  3
# 8:  8  5
# 9:  9  2
# 10: 10  2

Amended rle function:

rle<-function (x)
{
     if (!is.vector(x)&&  !is.list(x))
         stop("'x' must be an atomic vector")
     n<- length(x)
     if (n == 0L)
         return(structure(list(lengths = integer(), values = x),
             class = "rle"))

     #### BEGIN NEW SECTION PART 1 ####
     naRepFlag<-F
     if(any(is.na(x))){
         naRepFlag<-T
         IS_LOGIC<-ifelse(typeof(x)=="logical",T,F)

         if(typeof(x)=="logical"){
             x<-as.integer(x)
             naMaskVal<-2
         }else if(typeof(x)=="character"){
             naMaskVal<-paste(sample(c(letters,LETTERS,0:9),32,replace=T),collapse="")
         }else{
             naMaskVal<-max(0,abs(x[!is.infinite(x)]),na.rm=T)+1
         }

         x[which(is.na(x))]<-naMaskVal
     }
     #### END NEW SECTION PART 1 ####

     y<- x[-1L] != x[-n]
     i<- c(which(y), n)

     #### BEGIN NEW SECTION PART 2 ####
     if(naRepFlag)
         x[which(x==naMaskVal)]<-NA

     if(IS_LOGIC)
         x<-as.logical(x)
     #### END NEW SECTION PART 2 ####

     structure(list(lengths = diff(c(0L, i)), values = x[i]),
         class = "rle")
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top