I have a dataset of 100 different countries and for each country five variables. For each country, I want to do a linear regression and store the results afterwards. The main problem is, for some countries I have no data for some variables.
My data set has this structure:
set.seed(1)
Q <- as.data.frame(matrix(rnorm(360),9,40))
colnames(Q)[1]<- "Country"
colnames(Q)[2]<- "Variable"
colnames(Q)[3:40] <- paste(1900:1937)
Q[1:3,1] <- "CountryA"
Q[4:6,1] <- "CountryB"
Q[7:9,1] <- "CountryC"
Q[1:3,2] <- paste("var",1:3,sep="")
Q[4:6,2] <- paste("var",1:3,sep="")
Q[7:9,2] <- paste("var",1:3,sep="")
For each country, I want to do the regression :
lm(var1~var2+var3)
1. Example with balanced data set
My approach is as follows:
# subset the data set for wach country (if someone knows an easier approach, please tell me)
datasets <- list(NA)
j <- 1
for(cat in unique(Q$Country)){
sub <- subset(Q, Country==cat, select=c(2:40))
sub1 <- as.data.frame(t(sub))
colnames(sub1) <- sub[,1 ]
sub1 <- sub1[-1, ]
sub1$var1 <- as.numeric(as.character(sub1$var1))
sub1$var2 <- as.numeric(as.character(sub1$var2))
sub1$var3 <- as.numeric(as.character(sub1$var3))
sub1 <- sub1[,colSums(is.na(sub1))<nrow(sub1)]
datasets[[j]] <- sub1
j <- j+1
}
# apply linear regression to each dataset
regressions <- llply(datasets, lm, formula = var1 ~.)
# extract coefficients from regressions
coefs <- ldply(regressions, coef)
This is no problem:
>coefs
(Intercept) var2 var3
1 0.0009635977 0.1627555 -0.1738419
2 0.2571188803 -0.3548750 -0.0248167
3 0.1109881052 -0.0722544 0.1439666
2. Example with unbalanced data set
Now, I add missing variables to the data set:
# Add missing variables:
Q[2,3:40] <- rep(NA)
Q[6,3:40] <- rep(NA)
If I execute the loop of step 1. again, I receive an error message (the code works fine, but the last statement coefs <- ldply(regressions, coef)
fails):
[...]
> coefs <- ldply(regressions, coef)
Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) :
Results do not have equal lengths
My Question: How can I modify the code in a way, such that it works on an unbalanced data set (some variables are missing)?
Thanks for any help or suggestions!