Function to impute missing value [duplicate]

https://stackoverflow.com/questions/20273070

06-08-2022
|

Domanda

I have a dataframe that looks like this:

set.seed(300)
df <- data.frame(site = sort(rep(paste0("site", 1:5), 5)), 
                 value = sample(c(1:5, NA), replace = T, 25))

df 

    site value
1  site1    NA
2  site1     5
3  site1     5
4  site1     5
5  site1     5
6  site2     1
7  site2     5
8  site2     3
9  site2     3
10 site2    NA
11 site3    NA
12 site3     2
13 site3     5
14 site3     4
15 site3     4
16 site4    NA
17 site4    NA
18 site4     4
19 site4     4
20 site4     4
21 site5    NA
22 site5     3
23 site5     3
24 site5     1
25 site5     1

As you can see, there are several missing values in the valuecolumn. I need to replace missing values in the valuecolumn with the mean for a site. So if there is a missing value for value measured at site1, I need to impute the mean value for site1. However, the dataframe is constantly being added to and imported into R, and the next time I import the dataframe it will likely have increased to something like 50 rows in length and there are likely to be many more missing values in value. I need to make a function that will automatically detect which site a missing value in value was measured at, and impute the missing value for that particular site. Could anybody help me with this?

Soluzione

Using impute() from package Hmisc and ddply from package plyr:

require(plyr)
require(Hmisc)

df2 <- ddply(df, "site", mutate, imputed.value = impute(value, mean))

Altri suggerimenti

First, you can get the different levels of the sites.

sites=levels(df$site)

You can then get the means of different levels

nlevels=length(sites)
meanlist=numeric(nlevels)
for (i in 1:nlevels)
    meanlist[i]=mean(df[df[,1]==sites[i],2],na.rm=TRUE)

Then you can fill in each of the NA values. There's probably a faster way, but as long as your set isn't huge, you can do it with for loops.

for (i in 1:dim(df)[1])
    if (is.na(df[i,2]))
         df[i,2]=meanlist[which(sites==df[i,1])]

Hope this helps.

a solution in one (yes a long one) line with no for loop.

set.seed(300)
df <- data.frame(site = sort(rep(paste0("site", 1:5), 5)), 
                 value = sample(c(1:5, NA), replace = T, 25))


df$value[is.na(df$value)] <- ave(df$value, df$site, 
                                 FUN = function(x) 
                                mean(x, na.rm = TRUE))[c(which(is.na(df$value)))]

as a function:

fillITin <-  function(x){

x$value[is.na(x$value)] <- ave(x$value, x$site, 
                                     FUN = function(z) 
                                    mean(z, na.rm = TRUE))[c(which(is.na(x$value)))]
return(x)
}


fillITin(df)

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow