Pergunta

I have a data.frame that has several NAs. I already know that if a column has some zero values for some firm, then those NAs are zeros as well. What would be a good way to replace those and only those NAs with zeros.

An example:

I want to transform this

  FIRMID    VAR1     VAR2        
  FIRM1     0        1
  FIRM1     NA       NA
  FIRM2     1        0
  FIRM2     NA       NA

to this

  FIRMID    VAR1     VAR2        
  FIRM1     0        1
  FIRM1     0        NA
  FIRM2     1        0
  FIRM2     NA       0

EDIT: number of variables is possibly large and therefore I want to find a way to apply this neatly to all of them at the same time without manually entering each variable name.

Foi útil?

Solução

Here is another ddply alternative where you don't have to specify the variable names which the function should be applied on. By using numcolwise, the function operates on all numerical columns.

library(plyr)

myfun <- function(x){
  x[is.na(x) & (sum(!is.na(x) & x == 0) > 0)] <- 0
  x}

ddply(df, .(FIRMID), numcolwise(myfun))

#   FIRMID VAR1 VAR2
# 1  FIRM1    0    1
# 2  FIRM1    0   NA
# 3  FIRM2    1    0
# 4  FIRM2   NA    0

Or in base R, where I assume that the first column contains the grouping variable (dat[ , -1]). You could of course refer to it by name instead.

df2 <- do.call(rbind, by(df, df[ , "FIRMID"], function(dat){
  sapply(dat[ , -1], function(x){
    myfun(x)
  })
}))

data.frame(FIRMID = df$FIRMID, df2)

#   FIRMID VAR1 VAR2
# 1  FIRM1    0    1
# 2  FIRM1    0   NA
# 3  FIRM2    1    0
# 4  FIRM2   NA    0

Update 'myfun' can be written much simpler. Thanks @Arun for the suggestion!

myfun <- function(x){
  x[is.na(x) & any(x == 0)] <- 0
  x}

Outras dicas

If you don't have only integers you might need to adjust this for comparing floating point numbers:

DF <- read.table(text="FIRMID    VAR1     VAR2        
FIRM1     0        1
FIRM1     NA       NA
FIRM2     1        0
FIRM2     NA       NA", header=TRUE)

na_replace <- function(x) {
  if (any(na.omit(x)==0L)) x[is.na(x)] <- 0L
  x
}

library(plyr)
ddply(DF, .(FIRMID), transform, 
      VAR1=na_replace(VAR1),
      VAR2=na_replace(VAR2))

#  FIRMID VAR1 VAR2
#1  FIRM1    0    1
#2  FIRM1    0   NA
#3  FIRM2    1    0
#4  FIRM2   NA    0

you can use ddply here. but it would be very inefficient if the data.frame is very large. if not, then you could try:

your.data.frame<-ddply(your.data.frame,~FIRMID,function(x){
if ( any(x[!is.na(x$VAR1),"VAR1"]==0)){x[is.na(x$VAR1),"VAR1"]<-0}
if ( any(x[!is.na(x$VAR2),"VAR2"]==0)){x[is.na(x$VAR2),"VAR2"]<-0}
x})

but very inelegant

EDIT: my code before didn't worked, so I fixed it :)

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top