Question

I have a data frame containing observations for various individuals.

The first column contains the name of the individual, and the following columns contain the observed states, whereas each column represents one month.

During the observation period, individuals are born, resulting in NA observations before their birth, and they leave the population for a reason displayed in the last observation, resulting in NAs following the last observation. I would like to change the NAs before the first observation to a certain value, and change the NAs following the leaving of the population, to the last observation.

Since the data frame comprises more than 30,000 rows and about 400 columns, I am looking for an efficient way, other than a basic ifelse() approach.

Was it helpful?

Solution

na.locf() in the zoo package replaces NAs by carrying the last non-NA value forward. (Not only for trailing NAs, but also NAs in the middle of a vector - I assume you don't have those.) By default, it omits leading NAs. You can replace those by a specified value like this:

> library(zoo)
> xx <- c(NA, NA, 1, NA, 2, 3, NA, NA)
> replacement.for.initial.NAs <- -1
> foo <- min(which(!is.na(xx)))
> c(rep(replacement.for.initial.NAs,foo-1),na.locf(xx))
[1] -1 -1  1  1  2  3  3  3

You can loop this over your individuals. There is probably a smarter way involving apply() and friends to do this process per row or column of your data structure.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top