r Hmisc::dataframeReduce - replicate actions from one dataset to identically structured dataset

https://stackoverflow.com/questions/16839773

r
hmisc

30-05-2022
|

Question

I'm working on subsets of data from multiple time periods and I'd like to do column and level reduction on my training set and then apply the same actions to other datasets of the same structure.

dataframeReduce in the Hmisc package is what I've been using, but applying the function to different dataset results in slightly different actions.

trainPredictors<-dataframeReduce(trainPredictors, 
                  fracmiss=0.2, maxlevels=20,  minprev=0.075)
testPredictors<-dataframeReduce(testPredictors, 
                  fracmiss=0.2, maxlevels=20,  minprev=0.075)
testPredictors<-testPredictors[,names(trainPredictors)]

The final line ends up erroring because the backPredictors has a column removed that trainPredictors does retains. All other sets should have the transformations applied to trainPredictors applied to them.

Does anyone know how to apply the same cleanup actions to multiple datasets either using dataframeReduce or another function/block of code?

An example

Using the function NAins from http://trinkerrstuff.wordpress.com/2012/05/02/function-to-generate-a-random-data-set/

NAins <-  NAinsert <- function(df, prop = .1){
  n <- nrow(df)
  m <- ncol(df)
  num.to.na <- ceiling(prop*n*m)
  id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
  rows <- id %/% m + 1
  cols <- id %% m + 1
  sapply(seq(num.to.na), function(x){
    df[rows[x], cols[x]] <<- NA
  }
  )
  return(df)
}
library("Hmisc")
trainPredictors<-NAins(mtcars, .1) 
testPredictors<-NAins(mtcars, .3)
trainPredictors<-dataframeReduce(trainPredictors, 
                                 fracmiss=0.2, maxlevels=20,  minprev=0.075)
testPredictors<-dataframeReduce(testPredictors, 
                                fracmiss=0.2, maxlevels=20,  minprev=0.075)
testPredictors<-testPredictors[,names(trainPredictors)]

Solution

If your goal is to have the same variables with the same levels, then you need to avoid using dataframeReduce a second time, and instead use the same columns as produced by the dataframeReduce operation on hte train-set and apply factor reduction logic to the test-set in a manner that results in whatever degree of homology is needed of subsequent comparison operations. If it is a predict operation that is planned then you need to get the levels to be the same and you need to modify the code in dataframeReduce that works on the levels:

    if (is.category(x) || length(unique(x)) == 2) {
        tab <- table(x)
        if ((min(tab)/n) < minprev) {
            if (is.category(x)) {
              x <- combine.levels(x, minlev = minprev)
              s <- "grouped categories"
              if (length(levels(x)) < 2) 
                s <- paste("prevalence<", minprev, sep = "")
            }
            else s <- paste("prevalence<", minprev, sep = "")
        }
    }

So a better problem statement is likely to produce a better strategy. This will probably require both knowing what levels are in the entire set and in the train and test sets as well as what testing or predictions are anticipated (but not yet stated).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow