Time series data and ML - separating training/test data

https://datascience.stackexchange.com/questions/77235

12-12-2020
|

Question

I am using XGBoost to try to predict the direction of the stock market based on social media sentiment. Having read through some studies, I was planning to separate the training/test data by time period, e.g. use 2014-2016 data for training and 2016-2018 data for testing.

Does that make intuitive sense given the nature of the data I am using?

I am happy to provide any further details which would be helpful, thank you.

Solution

When you are working with time-series data, the most recent data captures the most relevant information possible, so it is more prudent to include them in training data. So a more prudent decision would be to opt for Roll-Forward Partitioning.

Roll-Forward Partitioning: We start with a short training period and we gradually increase it, at each iteration of training, we train it on the current training period and make it forecast the next interval of data. It will require more training time, but it mimics what we would do during deployment where we would want to keep training our model at regular intervals to keep it up to date.

You can find more about it here, here and here.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange