Question

Introduction

I understand the problem of data leakage that could be caused by the preprocessing step when our training and test sets are just samples of an unknown population. The preprocessing parameters should be calculated from the training set only, then we just apply the same procedure to validation/test set, since this would be the way to proceed with any other sample from the unknown population (in production stage, for example).

Question

What about the situation where we have the whole population at hand? Could we calculate the preprocessing parameters (scaling factors, encoding, etc.) from the entire population?

Extra Context

We have the whole population and the modeling process would depend of user input. The training set is defined by the user input and the trained model is used to classify the population.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top