Need help understanding data leakage

https://datascience.stackexchange.com/questions/75345

11-12-2020
|

문제

I am a newbie to this stuff so I am sorry if my question is stupid~

I need help understanding what data leakage between X_train and X_test is and when exactly it happens. I am currently working on a dataset where I am using KNN imputer to fill in the missing values. I need to scale the data for knn imputation and I am doing the train-test-split and applying machine learning models after the imputation process. I read that data leakage might happen during scaling so we should scale after splitting, fit_transform the train set, and only transform the test set. I am unsure about how that would work in my case since I am scaling the data to be able to impute for missing values and I actually reach the train-test-split stage later. Should I be worried about data leakage the way I am doing things?

Here is the code:

Although here I am doing the splitting + applying DT algorithm right after imputation, I have other steps like feature selection left so I won't reach the train-test-split and decision tree part until much later.

해결책

split the data into train and test
fit your imputer based on the train data set (use just fit)
use the fitted imputer and fill missing value in the training dataset (use transform)
Train you decision tree based on the training dataset

Now you are done with the training step. start testing a follows

Use imputer trained in step 2 and transform function to replace missing values in the test dataset
Use the trained decision tree for test prediction and evaluating the performance of your model on the unseen test dataset

The data should be split at the beginning. As the name of "unseen" shows, we should not use any information from the test dataset when we are training the model; otherwise, it is data leakage.

다른 팁

Data leakage occurs whan you have in your training set information which you are not supposed to have. This might be, for example, having in the training set information contained in the test set. Generally speaking, data leakage can provide a model which is "too good to be true", but not to be trusted as it was build using "cheatful" information. Find more information here.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange