I want to replace missing values in columns of a dataframe. I have written the following code

MedianImpute <- function(data=data)
     {
      for(i in 1:ncol(data))
        {        
        if(class(data[,i]) %in% c("numeric","integer"))
          {
          if(sum(is.na(data[,i])))
            {
            data[is.na(data[,i]),i] <- 
                          median(data[,i],na.rm = TRUE)
            }
          }
        }
      return(data)
      }

This returns the dataframe with the NAs replaced by the column median. I do no want to use for loop, how can I get the same result using any of the apply functions in R?

有帮助吗?

解决方案 2

This is actually a subtle problem, so worth a bit of discussion (IMO). You have a data frame and want to impute medians for numeric columns only, with the result being, of course, a data frame.

The apply(...) function will coerce it's argument to a matrix first. Since all elements in a matrix must by definition be the same data type, if there are any character or factor columns in the original df, the whole matrix will be coerced to char when it is passed to apply(...).

# 1st column of df is a factor
df <- data.frame(a=letters[1:5],x=sample(1:5,5),y=runif(5))
df[3,]$x <- NA
df[5,]$y <- NA
df
#   a  x         y
# 1 a  5 0.5235779
# 2 b  3 0.2142011
# 3 c NA 0.8886608
# 4 d  4 0.4952574
# 5 e  1        NA

apply(df,2,function(x) {
  if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x})
#      a   x    y          
# [1,] "a" " 5" "0.5235779"
# [2,] "b" " 3" "0.2142011"
# [3,] "c" NA   "0.8886608"
# [4,] "d" " 4" "0.4952574"
# [5,] "e" " 1" NA         

sapply(df,FUN=f) will pass the columns of df individually to a function f(...), but, the result will be matrix. So, for example, any factors in df will be coerced to integer.

sapply(df,function(x) {
  if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x})
#      a   x         y
# [1,] 1 5.0 0.5235779
# [2,] 2 3.0 0.2142011
# [3,] 3 3.5 0.8886608
# [4,] 4 4.0 0.4952574
# [5,] 5 1.0 0.5094176

So here, df$x and df$y are correct,but look what happened to df$a: the factor was coerced to numeric by returning the factor levels - not what you want!

lapply(df,FUN=F) will return a list, which can then be converted to a data frame. This approach gives you the desired result:

data.frame(lapply(df,function(x) {
    if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x}))
#   a   x         y
# 1 a 1.0 0.3093707
# 2 b 3.0 0.3486391
# 3 c 3.5 0.8292446
# 4 d 5.0 0.7882574
# 5 e 4.0 0.5684483

I suppose it's debatable whether this is any better than using a loop...

其他提示

You could use apply to apply a function across all columns

dat<-data.frame(c1=c(1,2,3,NA),c2=c(10, NA, 20, 30))
apply(dat, 2, function(x) ifelse(is.na(x), median(x, na.rm=T), x))

slightly faster

imputeMedianv3<-function(x) apply(x, 2, function(x){x[is.na(x)]<-median(x, na.rm=T); x})

I'm sure if what you're looking for is performance, someone will provide a data table solution (unfortunately I am not familiar with that package so can't do myself).

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top