Pre-processing (center, scale, impute) among training sets (different forms) and the test set - what is a good approach?

https://datascience.stackexchange.com/questions/4985

16-10-2019
|

Question

I am currently working on a multi-class classification problem with a large training set. However, it has some specific characteristics, which induced me to experiment with it, resulting in few versions of the training set (as a result of re-sampling, removing observations, etc).

I want to perform pre-processing of the data, that is to scale, center and impute (not much imputation though) values. This is the point where I've started to get confused.

I've been taught that you should always pre-process the test set in the same way you've pre-processed the training set, that is (for scaling and centering) to measure the mean and standard deviation on the training set and apply those values to the test set. This seems reasonably to me.

But what to do in case when you have shrinked/resampled training set? Should one focus on characteristics of the data that is actually feeding the model (that is what would 'train' function in R's caret package suggest, as you can put the pre-processing object in there directly) and apply these to the test set, or maybe one should capture the real characteristics of the data (from the whole untouched training set) and apply these? If the second option is better, maybe it would be worth it to capture the characteristics of the data by merging the training and test data together just for pre-processing step to get as accurate estimates as possible (I've actually never heard of anyone doing that though)?

I know I can simply test some of the approaches specified here, and I surely will, but are there any suggestions based on theory or your intuition/experience on how to tackle this problem?

I also have one additional and optional question. Does it make sense to center but NOT scale the data (or the other way around) in any case? Can anyone present any example where that approach would be reasonable?

Thank you very much in advance.

Solution

I thought about it this way: the training and test sets are both a sample of the unknown population. We assume that the training set is representative of the population we're studying. That is, whatever transformations we make to the training set are what we would make to the overall population. In addition, whatever subset of the training data we use, we assume that this subset represents the training set, which represents the population.

So in response to your first question, it's fine to use that shrinked/resmpled training as long as you feel it's still representative of that population. That's assuming your untouched training set captures the "real characteristics" in the first place :)

As for your second question, don't merge the training and testing set. The testing set is there to act as future unknown observations. If you build these into the model then you won't know if the model wrong or not, because you used up the data you were going to test it with.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange