Question

I have a dataset of 100 different countries and for each country five variables. For each country, I want to do a linear regression and store the results afterwards. The main problem is, for some countries I have no data for some variables.

My data set has this structure:

set.seed(1)
Q <- as.data.frame(matrix(rnorm(360),9,40))
colnames(Q)[1]<- "Country"
colnames(Q)[2]<- "Variable"
colnames(Q)[3:40] <- paste(1900:1937)
Q[1:3,1] <- "CountryA"
Q[4:6,1] <- "CountryB"
Q[7:9,1] <- "CountryC"
Q[1:3,2] <- paste("var",1:3,sep="")
Q[4:6,2] <- paste("var",1:3,sep="")
Q[7:9,2] <- paste("var",1:3,sep="")

For each country, I want to do the regression :

lm(var1~var2+var3)

1. Example with balanced data set

My approach is as follows:

# subset the data set for wach country (if someone knows an easier approach, please tell me)
datasets <- list(NA)
j <- 1
for(cat in unique(Q$Country)){
  sub <- subset(Q, Country==cat, select=c(2:40))
  sub1 <- as.data.frame(t(sub))
  colnames(sub1) <- sub[,1 ]
  sub1 <- sub1[-1, ]
  sub1$var1 <- as.numeric(as.character(sub1$var1)) 
  sub1$var2 <- as.numeric(as.character(sub1$var2))
  sub1$var3 <- as.numeric(as.character(sub1$var3))
  sub1 <- sub1[,colSums(is.na(sub1))<nrow(sub1)]
  datasets[[j]] <- sub1
  j <- j+1

}

# apply linear regression to each dataset
regressions <-  llply(datasets, lm, formula = var1 ~.)

# extract coefficients from regressions
coefs <- ldply(regressions, coef)

This is no problem:

>coefs
   (Intercept)       var2       var3

1 0.0009635977  0.1627555 -0.1738419

2 0.2571188803 -0.3548750 -0.0248167

3 0.1109881052 -0.0722544  0.1439666

2. Example with unbalanced data set

Now, I add missing variables to the data set:

# Add missing variables: 
Q[2,3:40] <- rep(NA) 
Q[6,3:40] <- rep(NA)

If I execute the loop of step 1. again, I receive an error message (the code works fine, but the last statement coefs <- ldply(regressions, coef) fails):

[...]
> coefs <- ldply(regressions, coef)
Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) : 
  Results do not have equal lengths

My Question: How can I modify the code in a way, such that it works on an unbalanced data set (some variables are missing)?

Thanks for any help or suggestions!

Was it helpful?

Solution

Replace the columns which are all NA with zeros:

Coef <- function(x) {
    DF <- setNames(as.data.frame(t(x[-(1:2)])), x$Variable)
    DF[colSums(is.na(DF)) == nrow(DF)] <- 0
    coef(lm(var1 ~., DF))
}
do.call(rbind, by(Q, Q$Country, Coef))

giving:

         (Intercept)        var2       var3
CountryA  0.01863015          NA -0.1982462
CountryB  0.26296826 -0.35416216         NA
CountryC  0.11098809 -0.07225439  0.1439667
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top