looping regressions on unblanced data set in R (using apply functions)

https://stackoverflow.com/questions/23428086

14-07-2023
|

Question

I have a dataset of 100 different countries and for each country five variables. For each country, I want to do a linear regression and store the results afterwards. The main problem is, for some countries I have no data for some variables.

My data set has this structure:

set.seed(1)
Q <- as.data.frame(matrix(rnorm(360),9,40))
colnames(Q)[1]<- "Country"
colnames(Q)[2]<- "Variable"
colnames(Q)[3:40] <- paste(1900:1937)
Q[1:3,1] <- "CountryA"
Q[4:6,1] <- "CountryB"
Q[7:9,1] <- "CountryC"
Q[1:3,2] <- paste("var",1:3,sep="")
Q[4:6,2] <- paste("var",1:3,sep="")
Q[7:9,2] <- paste("var",1:3,sep="")

For each country, I want to do the regression :

lm(var1~var2+var3)

1. Example with balanced data set

My approach is as follows:

# subset the data set for wach country (if someone knows an easier approach, please tell me)
datasets <- list(NA)
j <- 1
for(cat in unique(Q$Country)){
  sub <- subset(Q, Country==cat, select=c(2:40))
  sub1 <- as.data.frame(t(sub))
  colnames(sub1) <- sub[,1 ]
  sub1 <- sub1[-1, ]
  sub1$var1 <- as.numeric(as.character(sub1$var1)) 
  sub1$var2 <- as.numeric(as.character(sub1$var2))
  sub1$var3 <- as.numeric(as.character(sub1$var3))
  sub1 <- sub1[,colSums(is.na(sub1))<nrow(sub1)]
  datasets[[j]] <- sub1
  j <- j+1

}

# apply linear regression to each dataset
regressions <-  llply(datasets, lm, formula = var1 ~.)

# extract coefficients from regressions
coefs <- ldply(regressions, coef)

This is no problem:

>coefs
   (Intercept)       var2       var3

1 0.0009635977  0.1627555 -0.1738419

2 0.2571188803 -0.3548750 -0.0248167

3 0.1109881052 -0.0722544  0.1439666

2. Example with unbalanced data set

Now, I add missing variables to the data set:

# Add missing variables: 
Q[2,3:40] <- rep(NA) 
Q[6,3:40] <- rep(NA)

If I execute the loop of step 1. again, I receive an error message (the code works fine, but the last statement coefs <- ldply(regressions, coef) fails):

[...]
> coefs <- ldply(regressions, coef)
Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) : 
  Results do not have equal lengths

My Question: How can I modify the code in a way, such that it works on an unbalanced data set (some variables are missing)?

Thanks for any help or suggestions!

Solution

Replace the columns which are all NA with zeros:

Coef <- function(x) {
    DF <- setNames(as.data.frame(t(x[-(1:2)])), x$Variable)
    DF[colSums(is.na(DF)) == nrow(DF)] <- 0
    coef(lm(var1 ~., DF))
}
do.call(rbind, by(Q, Q$Country, Coef))

giving:

         (Intercept)        var2       var3
CountryA  0.01863015          NA -0.1982462
CountryB  0.26296826 -0.35416216         NA
CountryC  0.11098809 -0.07225439  0.1439667

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow