deleting rows for which there are too many NA values in a PLM data frame

https://stackoverflow.com/questions/19075891

29-06-2022
|

Вопрос

I am working with a rather large panel of data on 180 countries from 1950 to 2003. I have been using the plm package in R. One thing I need to do is remove countries for which there are too few GDP observations, or, in other words, too many NA's. Here's a dummy example of what I am trying to do

## generate dummy data
library(plm)
c1 <- rep(NA,20)
c2 <- rep(c(1,NA),10)
c3 <- c(1:15,NA,NA,NA,NA,NA)
c4 <- c(NA,1:19)
c5 <- c(1:20)
country <- c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20))
year <- rep(1:20,5)
df <- data.frame(year, country, gdp=c(c1,c2,c3,c4,c5))
pd <- pdata.frame(df,index=c("country","year"))

I then generated a vector which counts how many GDP observations there are in each country as follows

gdp.observations <- apply(as.matrix(pd$gdp),1,
                          function(x) length(is.na(x)[is.na(x)==FALSE]))

Which produces the vector

> gdp.observations
 A  B  C  D  E 
 0 10 15 19 20

What I would like to do now is to use this vector to make a pdata.frame which includes only the countries for which gdp.observations is above a certain threshold—say for example, 15. Is there a nice way to do this?

Решение

I suggest ave to count the number of NA per country and subsequently exclude any rows with too few observations:

n <- ave(pd$gdp, pd$country, FUN=function(x)sum(!is.na(x)))

pd2 <- pd[n > 15, ]

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow