Robustness of hyperparameter tuning

https://datascience.stackexchange.com/questions/74710

11-12-2020
|

Pergunta

I use a Bayesian hyperparameter (HP) optimization approach (BOHB) to tune a deep learning model. However, the resulting model is not robust when repeatedly applied to the same data. I know, I could use a seed to fix the parameter initialization, but I wonder if there are HP optimization approaches that already account for robustness.

To illustrate the problem, let's consider a one-layer neural network with only one HP: the hidden size (h). The model performs well with a small h. With a larger h, the results start to fluctuate more, maybe due to a more complex loss landscape; the random initialization of the parameters can lead to a good performance, or to a very bad performance if the optimizer gets stuck in a local minimum (which happens more often due to the complex loss landscape). The loss vs h plot could look something like this:

I would prefer the 'robust solution', while the 'best solution' is selected by the HP optimizer algorithm. Are there HP optimization algorithms that account for the robustness? Or how would you deal with this problem?

Solução

As I understand them, Bayesian optimization approaches are already somewhat robust to this problem. The evaluated performance function is usually(?) considered noisy, so that the search would want to check nearby the "best solution" $h$ to improve certainty; if it then finds lots of poorly performing models, its surrogate function should start to downplay that point. (See e.g. these two blog posts.)

If the instability is large due to random effects (e.g. initializations of weights that you mention), then just repeating the model fit and taking an average (or worst, or some percentile) of the performances should work well. If it's really an effect of "neighboring" $h$ values, then you could similarly fit models near the selected $h$ and consider their aggregate performance. Of course, both of these add quite a bit of computational expense; but I think this might be the closest to "the right" solution that doesn't depend on the internals of the optimization algorithm.

Outras dicas

One option is not to measure the performance of the hyperparameters on the loss function of the training data but measure performance of the hyperparameters on the elevation metric on the validation data. The end goal of the most machine learning systems is the ability to predict on unseen data. Focusing on "best solution" as measured by loss function on training will lead to overfitting / non-robust solutions.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange