Frage

I have a data frame (df) with missing values and want to impute interpolated values with restriction. My data frame is:

X<-c(100,NA,NA,70,NA,NA,NA,NA,NA,NA,35)
Y<-c(10,NA,NA,40,NA,NA,NA,NA,NA,NA,5)
Z<-c(50,NA,NA,20,NA,NA,NA,NA,NA,NA,90)
df<-as.data.frame(cbind(X,Y,Z))
df
     X  Y  Z
1  100 10 50
2   NA NA NA
3   NA NA NA
4   70 40 20
5   NA NA NA
6   NA NA NA
7   NA NA NA
8   NA NA NA
9   NA NA NA
10  NA NA NA
11  35  5 90

I was able to impute missing values from linear interpolation of the known values using:

 data.frame(lapply(df, function(X) approxfun(seq_along(X), X)(seq_along(X))))
     X  Y  Z
1  100 10 50
2   90 20 40
3   80 30 30
4   70 40 20
5   65 35 30
6   60 30 40
7   55 25 50
8   50 20 60
9   45 15 70
10  40 10 80
11  35  5 90

My question is how can I put constraint to the interpolation? Say NAs more than 5 consecutive entries should remain as NAs and not be imputed by linear interpolation so that my new data frame would look like:

    X  Y  Z
1  100 10 50
2   90 20 40
3   80 30 30
4   70 40 20
5   NA NA NA
6   NA NA NA
7   NA NA NA
8   NA NA NA
9   NA NA NA
10  NA NA NA
11  35  5 90
War es hilfreich?

Lösung

Here's something that works. It uses na.rm to identify NAs, rle to identify runs of NAs, and then cumsum to translate those runs into positions in the vector.

data.frame(lapply(df, function(X) {
    af = approxfun(seq_along(X), X)
    rl = rle(is.na(X))
    cu = cumsum(rl$length)
    L=5
    unlist(sapply(1:length(cu), function(x) {
        if (rl$values[x] & rl$length[x]>L) rep(NA, rl$lengths[x])
        else af(seq(cu[x]-rl$lengths[x]+1,cu[x]))
    }))
}))
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top