Domanda

I have a dataframe that looks like this:

set.seed(300)
df <- data.frame(site = sort(rep(paste0("site", 1:5), 5)), 
                 value = sample(c(1:5, NA), replace = T, 25))

df 

    site value
1  site1    NA
2  site1     5
3  site1     5
4  site1     5
5  site1     5
6  site2     1
7  site2     5
8  site2     3
9  site2     3
10 site2    NA
11 site3    NA
12 site3     2
13 site3     5
14 site3     4
15 site3     4
16 site4    NA
17 site4    NA
18 site4     4
19 site4     4
20 site4     4
21 site5    NA
22 site5     3
23 site5     3
24 site5     1
25 site5     1    

As you can see, there are several missing values in the valuecolumn. I need to replace missing values in the valuecolumn with the mean for a site. So if there is a missing value for value measured at site1, I need to impute the mean value for site1. However, the dataframe is constantly being added to and imported into R, and the next time I import the dataframe it will likely have increased to something like 50 rows in length and there are likely to be many more missing values in value. I need to make a function that will automatically detect which site a missing value in value was measured at, and impute the missing value for that particular site. Could anybody help me with this?

È stato utile?

Soluzione

Using impute() from package Hmisc and ddply from package plyr:

require(plyr)
require(Hmisc)

df2 <- ddply(df, "site", mutate, imputed.value = impute(value, mean))

Altri suggerimenti

First, you can get the different levels of the sites.

sites=levels(df$site)

You can then get the means of different levels

nlevels=length(sites)
meanlist=numeric(nlevels)
for (i in 1:nlevels)
    meanlist[i]=mean(df[df[,1]==sites[i],2],na.rm=TRUE)

Then you can fill in each of the NA values. There's probably a faster way, but as long as your set isn't huge, you can do it with for loops.

for (i in 1:dim(df)[1])
    if (is.na(df[i,2]))
         df[i,2]=meanlist[which(sites==df[i,1])]

Hope this helps.

a solution in one (yes a long one) line with no for loop.

set.seed(300)
df <- data.frame(site = sort(rep(paste0("site", 1:5), 5)), 
                 value = sample(c(1:5, NA), replace = T, 25))


df$value[is.na(df$value)] <- ave(df$value, df$site, 
                                 FUN = function(x) 
                                mean(x, na.rm = TRUE))[c(which(is.na(df$value)))]

as a function:

fillITin <-  function(x){

x$value[is.na(x$value)] <- ave(x$value, x$site, 
                                     FUN = function(z) 
                                    mean(z, na.rm = TRUE))[c(which(is.na(x$value)))]
return(x)
}


fillITin(df)
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top