Question

I am trying to verify in a data.table object which columns have non-null data (is not NA) values greater than a certain threshold (for example: 5), and subsequently discard the columns which do not pass in the criteria.

Consider the following data:

require(data.table)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,NA,6), v=c(1,2,NA,NA,NA,NA,NA,8,9))
DT
   x  y  v
1: a  1  1
2: a NA  2
3: a  6 NA
4: b  1 NA
5: b NA NA
6: b  6 NA
7: c  1 NA
8: c NA  8
9: c  6  9

In the above example, column v has only 4 non NA values, which is smaller than 5, so I'd like to discard the column:

DT[,c(3) := NULL]
DT
   x  y
1: a  1
2: a NA
3: a  6
4: b  1
5: b NA
6: b  6
7: c  1
8: c NA
9: c  6

I am needing help to understand the way to go combining the .N* symbol and 'if statements' with data.table to check an object with many columns.

My question is, how could I do the count programmatically in all columns, and discard only the ones which not pass the criteria? Tks.

*I am not sure if .N is needed but from previous research I understood this symbol is used for counting inside data.table objects

Was it helpful?

Solution

Here is one way of doing it:

DT[, which(lapply(DT, function(x) sum(!is.na(x))) < 5) := NULL]

Since data.table is a list of columns, lapply loops over the individual columns and applies the required function. After that which enumerates the columns we're interested in, and := removes them.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top