Pergunta

The Adam optimizer has four main hyperparameters. For example, looking at the Keras interface, we have:

keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)

The first hyperparameter is called step size or learning rate. In theory, an adaptive optimization method should automatically modify the learning rate during optimization. Thus, I would expect lr to be a not very influential hyperparameter, which I can remove from the list of hyperparameters I have to tune on the validation set, thus saving some time.

Does this hold true in practice? I.e., is it true that, at least for a class of problems (say, image recognition), lr doesn't affect the optimization a lot, and thus we can just leave it at the default value of 0.001? Or is it still as extremely influential, as it is for SGD with momentum?

EDIT: to clear a misunderstanding brought up in an answer, learning rate and step size are synonymous. See the algorithm definition in the Adam paper: $\alpha$, whose default value is 0.001, is clearly named the step size. I reckon it's probably not a great name (the actual size of the step in the parameter space depends on the accumulated first order and second order momentums, as well as on the gradient, of course) but unfortunately such misleading terminology is the norm in optimization (at least in Deep Learning papers).

enter image description here

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
scroll top