Imputation in R

https://stackoverflow.com/questions/13114812

r
imputation

15-07-2021
|

質問

I am new in R programming language. I just wanted to know is there any way to impute null values of just one column in our dataset. Because all of imputation commands and libraries that I have seen, impute null values of the whole dataset.

解決

Here is an example using the Hmisc package and impute

library(Hmisc)
DF <- data.frame(age = c(10, 20, NA, 40), sex = c('male','female'))

# impute with mean value

DF$imputed_age <- with(DF, impute(age, mean))

# impute with random value
DF$imputed_age2 <- with(DF, impute(age, 'random'))

# impute with the media
with(DF, impute(age, median))
# impute with the minimum
with(DF, impute(age, min))

# impute with the maximum
with(DF, impute(age, max))


# and if you are sufficiently foolish
# impute with number 7 
with(DF, impute(age, 7))

 # impute with letter 'a'
with(DF, impute(age, 'a'))

Look at ?impute for details on how the imputation is implemented

他のヒント

Why not use more sophisticated imputation algorithms, such as mice (Multiple Imputation by Chained Equations)? Below is a code snippet in R you can adapt to your case.

library(mice)

#get the nhanes dataset
dat <- mice::nhanes

#impute it with mice
imp <- mice(mice::nhanes, m = 3, print=F)

imputed_dataset_1<-complete(imp,1)

head(imputed_dataset_1)

#     age  bmi hyp chl
# 1   1   22.5   1 118
# 2   2   22.7   1 187
# 3   1   30.1   1 187
# 4   3   24.9   1 186
# 5   1   20.4   1 113
# 6   3   20.4   1 184

#Now, let's see what methods have been used to impute each column
meth<-imp$method
#  age   bmi   hyp   chl
#"" "pmm" "pmm" "pmm"

#The age column is complete, so, it won't be imputed
# Columns bmi, hyp and chl are going to be imputed with pmm (predictive mean matching)

#Let's say that we want to impute only the "hyp" column
#So, we set the methods for the bmi and chl column to ""
meth[c(2,4)]<-""
#age   bmi   hyp   chl 
#""    "" "pmm"    "" 

#Let's run the mice imputation again, this time setting the methods parameter to our modified method
imp <- mice(mice::nhanes, m = 3, print=F, method = meth)

partly_imputed_dataset_1 <- complete(imp, 3)

head(partly_imputed_dataset_1)

#    age  bmi hyp chl
# 1   1   NA   1  NA
# 2   2 22.7   1 187
# 3   1   NA   1 187
# 4   3   NA   2  NA
# 5   1 20.4   1 113
# 6   3   NA   2 184

There are plenty of packages that can do this for you. (a little more information about the data could help suggesting you the best options)

One example can be using the VIM package.

It has a function called kNN (k-nearest-neighbor imputation) This function has a option variable where you can specify which variables shall be imputed.

Here is an example:

library("VIM")
kNN(sleep, variable = c("NonD","Gest"))

The sleep dataset I used in this example comes along with VIM.

If there is some time dependency in your columns you want to impute using time series imputation packages could also make sense. In this case you could use for example the imputeTS package. Here is an example:

  library(imputeTS)
  na.kalman(tsAirgap)

The tsAirgap dataset used here as an example comes also along with imputeTS.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow