Removing Rows in R Has Distorted the Data Set

https://stackoverflow.com/questions/17147240

31-05-2022
|

Pregunta

I have removed certain rows from my database using the following code:

df2 <- df1[!(df1$variable==1), ]

This was a dummy variable, and the rows that had the value of 1 for that particular dummy variable were successfully removed. (I checked the dimensions of my database using the "dim" function before and after; and everything seemed normal.)

However, after I ran my regression model this time with the new data set "df2", I saw that the degrees-of-freedom had fallen sharply! This was way over the number of the removed rows!

I wondered how this could happen. Then, I realized that the new data set had many rows that had NAs only. At each row that the random variable had a missing value, R had made a full row of NA values.

After realizing that the above code was not the best way to delete rows, I tried the following:

df2 <- df1[(df1$variable==0 | is.na(df1$variable)), ]

It seems to have worked, since I no longer have the same problem. But would you say that this new code above may have some (similar or other) problems that I am not really aware of right now?

Solución

The new code should be fine. The problem with the old code was caused by a combination of the NAs in df1$variable and the == comparison operator.

If you read the help on comparison operators, ?"==", you will see, "Missing values (NA) and NaN values are regarded as non-comparable even to themselves, so comparisons involving them will always result in NA."

In your case, whenever the df1$variable was NA, the results of your attempted subset was NA (not TRUE or FALSE), which caused the other variables in the row to be NA. For example:

df1 <- expand.grid(variable=c(0, 1, NA), var2=c(0, 1, NA))

sel1 <- !(df1$variable==1)
sel1
df1[sel1, ]

sel2 <- df1$variable==0 | is.na(df1$variable)
sel2
df1[sel2, ]

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow