Question

I use a classification model on time-series data where I normalize the data before splitting the data into train and test. Now, I know that train and test data should be treated separately to prevent data leaking. What could be the proper order of normalization steps here? Should I apply steps 1,2,3 separately to train and test after I split data with the help of a sliding window? I use a sliding window here to compare each hour (test) with its previous 24 hrs data (train). Here is the order that I am currently using in the pipeline.

  1. Moving averages (mean)
  2. Resampling every hour
  3. Standardization
  4. Split data into train and test using a sliding window (of a length 24 hrs (train) and slides every 1 hr (test))
  5. Fit the model using train data
  6. Predict using the test data
Was it helpful?

Solution

Assuming I understand it correctly, I think your process is ok this way but I'm not sure about step 3 "standardization":

  • Steps 1 and 2 are ok since this cannot leak any kind of information from the test set to the training set.
  • If the standardization step involves calculating values (e.g. mean and s.d.) over the whole data (including test set) and then standardizing the whole data using these values, then that could be an issue. I'd suggest moving step 3 after the split, so that standardization is based only on the training set and then applied "as is" to the test set.
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top