R: “apply” statement to take the sum of the number of non-NA values across multiple columns

https://stackoverflow.com/questions/10486110

06-06-2021
|

Question

I have a large dataframe of doctor visit records. Each record (row) can have up to 11 diagnosis codes. I want to know how many non-NA diagnosis codes are in each row.

Here is a sample of the data:

diag1 diag2 diag3 diag4 diag5 diag6 diag7 diag8 diag9 diag10 diag11
786   272   401   782    250  91912  530    NA    NA    NA     NA   
845   530   338   311    NA    NA    NA     NA    NA    NA     NA

So in these two rows, I would want to know that row 1 had 7 codes and row 2 had 4 codes. The dataframe is 31,596 rows so a loop is taking way too long. I'd like to use an "apply" statement to speed things up:

z = apply(y[,paste("diag", 1:11, sep="")], 1, function(x)sum({any(x[!is.na(x)])}))

R just returns a vector of 1's that is the same length as the number of rows in the dataset. I think something is wrong with using "any"? Does anyone have a good way to count the number of non-NA values across multiple columns? Thanks!

Solution

Just use is.na and rowSums:

z <- rowSums(!is.na(y[,paste("diag", 1:11, sep="")]))

OTHER TIPS

You could also use:

apply(y, 1, function(x) length(na.omit(x)))

but Joshua Ulrich's answer is way faster.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow