Filtering data in R

https://stackoverflow.com/questions/476726

filtering
r

20-08-2019
|

Question

I have a CSV of file of data that I can load in R using read.csv()

Some of the data is missing, so I want to reduce the data frame down to set that consists entirely of non-missing data, i.e. if a NULL appears anywhere, I want to exclude that column and row from the filtered data set.

I know I can probably do this fairly simply with the inbuilt R vector operations, but I am not quite sure how to do this exactly?

To make my question a little more concrete, here is a quick sample of the data so you can see what I want to do.

DocID       Anno1    Anno7  Anno8
1           7        NULL   8
2           8        NULL   3
44          10       2      3
45          6        6      6
46          1        3      4
49          3        8      5
62          4        NULL   9
63          2        NULL   4
67          11       NULL   3
91          NULL     9      7
92          NULL     7      5
93          NULL     8      8

So given this input, I need some code that will reduce the output to this.

DocID       Anno8
44          3
45          6
46          4
49          5

As Anno8 is the only column with non-NULL data, and there are only four rows with non-NULL data.

Solution

If x is your data.frame (or matrix) then

x[ ,apply(x, 2, function(z) !any(is.na(z)))]

Since your example uses NULL, is.na(·) will be replaced by is.null(·)

Alternatively you can look at subset(·).

OTHER TIPS

You can drop any row containing a missing using na.omit(), however that's not what you want. Moreover, the currently accepted answer is wrong. It gives you complete columns, but does not drop the rows that have one or more missing values, which is what was asked for. The correct answer can be obtained as:

> a <- data.frame(a=c(1,2),b=c(NA,1), c=c(3,4))
> a
  a  b c
1 1 NA 3
2 2  1 4
> na.omit(a)[,colSums(is.na(a))==0]
  a c
2 2 4

To see that the above answer is wrong:

> a[ ,apply(a, 2, function(z) !any(is.na(z)))]
  a c
1 1 3
2 2 4

Row 1 should be dropped because of the NA in column 2.

a <- data.frame(a=c(1,2,0,1),b=c(NA,1,NA,1), c=c(3,4,5,1))

na.omit(a)
  a b c
2 2 1 4
4 1 1 1

a[rowSums(is.na(a))==0,]
  a b c
2 2 1 4
4 1 1 1

a[complete.cases(a),]
  a b c
2 2 1 4
4 1 1 1

Also, you can do it using sqldf library, if x is your dataframe:

library(sqldf)
result <- sqldf("SELECT DocID, Anno8 FROM x
                 WHERE Anno1 IS NOT NULL AND Anno7 IS NOT NULL")

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow