Can preprocessing the whole population cause data leakage?
-
31-10-2019 - |
Question
Introduction
I understand the problem of data leakage that could be caused by the preprocessing step when our training and test sets are just samples of an unknown population. The preprocessing parameters should be calculated from the training set only, then we just apply the same procedure to validation/test set, since this would be the way to proceed with any other sample from the unknown population (in production stage, for example).
Question
What about the situation where we have the whole population at hand? Could we calculate the preprocessing parameters (scaling factors, encoding, etc.) from the entire population?
Extra Context
We have the whole population and the modeling process would depend of user input. The training set is defined by the user input and the trained model is used to classify the population.
No correct solution