I have a data frame (df) with missing values and want to impute interpolated values with restriction. My data frame is:

X<-c(100,NA,NA,70,NA,NA,NA,NA,NA,NA,35)
Y<-c(10,NA,NA,40,NA,NA,NA,NA,NA,NA,5)
Z<-c(50,NA,NA,20,NA,NA,NA,NA,NA,NA,90)
df<-as.data.frame(cbind(X,Y,Z))
df
     X  Y  Z
1  100 10 50
2   NA NA NA
3   NA NA NA
4   70 40 20
5   NA NA NA
6   NA NA NA
7   NA NA NA
8   NA NA NA
9   NA NA NA
10  NA NA NA
11  35  5 90

I was able to impute missing values from linear interpolation of the known values using:

 data.frame(lapply(df, function(X) approxfun(seq_along(X), X)(seq_along(X))))
     X  Y  Z
1  100 10 50
2   90 20 40
3   80 30 30
4   70 40 20
5   65 35 30
6   60 30 40
7   55 25 50
8   50 20 60
9   45 15 70
10  40 10 80
11  35  5 90

My question is how can I put constraint to the interpolation? Say NAs more than 5 consecutive entries should remain as NAs and not be imputed by linear interpolation so that my new data frame would look like:

    X  Y  Z
1  100 10 50
2   90 20 40
3   80 30 30
4   70 40 20
5   NA NA NA
6   NA NA NA
7   NA NA NA
8   NA NA NA
9   NA NA NA
10  NA NA NA
11  35  5 90
有帮助吗?

解决方案

Here's something that works. It uses na.rm to identify NAs, rle to identify runs of NAs, and then cumsum to translate those runs into positions in the vector.

data.frame(lapply(df, function(X) {
    af = approxfun(seq_along(X), X)
    rl = rle(is.na(X))
    cu = cumsum(rl$length)
    L=5
    unlist(sapply(1:length(cu), function(x) {
        if (rl$values[x] & rl$length[x]>L) rep(NA, rl$lengths[x])
        else af(seq(cu[x]-rl$lengths[x]+1,cu[x]))
    }))
}))
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top