The strategy here is to split the data frame by columns into variable groups, and for each row identifying if there are non-NA values. We then check with rowsums
to make sure there are at least two variables with non-NA values for a row, and if so, add the mean of those values with cbind
.
This will generalize to any number of columns so long as they are named in the AA_varXXX format, and so long as the only column not in that format is myid
. Easy enough to fix if this isn't strictly the case, but these are the limitations on the code as written now.
df.dat <- df[!names(df) == "myid"]
diverse.rows <- rowSums(
sapply(
split.default(df.dat, gsub("^([A-Z]{2})_var.*", "\\1", names(df.dat))),
function(x) apply(x, 1, function(y) any(!is.na(y)))
) ) > 1
cbind(df, div.mean=ifelse(diverse.rows, rowMeans(df.dat, na.rm=T), NA))
Produces:
AA_var1 AA_var2 myid BB_var3 BB_var4 div.mean
1 NA NA 123456 10 12 NA
2 NA 10 194200 12 NA 11
3 12 10 132200 NA NA NA
4 12 NA 132201 NA 12 12
5 NA NA 132202 NA NA NA
6 12 13 132203 14 NA 13