Should a model be re-trained if new observations are available?

https://datascience.stackexchange.com/questions/12761

16-10-2019
|

Question

So, I have not been able to find any literature on this subject but it seems like something worth giving a thought:

What are the best practices in model training and optimization if new observations are available?
Is there any way to determine the period/frequency of re-training a model before the predictions begin to degrade?
Is it over-fitting if the parameters are re-optimised for the aggregated data?

Note that the learning may not necessarily be online. One may wish to upgrade an existing model after observing significant variance in more recent predictions.

Solution

Once a model is trained and you get new data which can be used for training, you can load the previous model and train onto it. For example, you can save your model as a .pickle file and load it and train further onto it when new data is available. Do note that for the model to predict correctly, the new training data should have a similar distribution as the past data.
Predictions tend to degrade based on the dataset you are using. For example, if you are trying to train using twitter data and you have collected data regarding a product which is widely tweeted that day. But if you use use tweets after some days when that product is not even discussed, it might be biased. The frequency will be dependent on dataset and there is no specific time to state as such. If you observe that your new incoming data is deviating vastly, then it is a good practise to retrain the model.
Optimizing parameters on the aggregated data is not overfitting. Large data doesn't imply overfitting. Use cross validation to check for over-fitting.

OTHER TIPS

When new observations are available, there are three ways to retrain your model:

Online: each time a new observation is available, you use this single data point to further train your model (e.g. load your current model and further train it by doing backpropagation with that single observation). With this method, your model learns in a sequential manner and sort of adapts locally to your data in that it will be more influenced by the recent observations than by older observations. This might be useful in situations where your model needs to dynamically adapt to new patterns in data. It is also useful when you are dealing with extremely large data sets for which training on all of it at once is impossible.
Offline: you add the new observations to your already existing data set and entirely retrain your model on this new, bigger data set. This generally leads to a better global approximation of the target function and is very popular if you have a fixed data set, or if you don't have new observations to often. However it is unpractical for large data sets.
Batch/mini batch: this is sort of a middle ground approach. With batch, you wait until you have a batch of $n$ new observations and then train your already existing model on this whole batch. It is not offline as you are not adding this batch to your preexisting data set and then retraining your model on it and it is not online as your are training your model on $n$ observations at once and not just a single one. So it's a bit of both :) Mini batch is exactly the same except that the batch size is smaller so it tends towards online learning. Actually online learning is just batch with batch size 1 and offline is batch with batch size the size of the whole data set.

Most models today will use batch/mini batch and the choice for the size of the batch depends on your application and model. Choosing the right size batch is equivalent to choosing the right frequency with which to re-train your model. If your new observation have a low variance with your existing data, I'd suggest larger batches (256-512 maybe) and if on the contrary new observations tend to vary greatly with your existing data, use small batches (8-256). At the end of the day, batch size is kind of like another hyper-parameter which you need to tune and which is specific to your data

Your problem comes under the umbrella of Online Learning methods. Assuming a stream of data coming, you can use Stochastic Gradient Descent method to update your model parameters using that single example.

If your cost function is :

$ \min_\theta J(x,y,\theta) $ ,

where $\theta$ is parameter vector, then assuming a streaming data of form ($x^{i}, y^{i}$), you can update your parameter vector using SGD with the following update equation :

$ \theta^{t} = \theta^{t-1} - \nabla_\theta J(x^{i}, y^{i}) $.

This is essentially SGD with batch size 1.

There is one other trick, you can adopt a window/buffer based method, where you buffer some examples from stream and treat it as batch and use batch SGD. In that case the update equation will become:

$ \theta^{t} = \theta^{t-1} - \sum_{i} \nabla_\theta J(x^{i}, y^{i}) $.

This is essentially mini-batch SGD.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange