When I add regularization like L1,L2 , do I need more epochs to properly train my model?

https://datascience.stackexchange.com/questions/81730

13-12-2020
|

Question

When I add regularization techniques in my model like L1 or L2 do i need more epochs to properly converge my model.

for r in (None,"L1","L2"):
        for max_iter in (30,45,60):    
            classifier=SGDClassifier(loss="log",penalty=r,max_iter=max_iter,learning_rate="constant",eta0=0.01,random_state=42)
            print("max_iter={}".format(max_iter))
            classifier.fit(X_train,Y_train)
            acc=classifier.score(X_test,Y_test)
            print("accuracy when r={} is {}".format(r,acc*100))

When r = None:

max_iter = 30/45 it says ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
max_iter = 60 no warning.

When r = L1:

max_iter= 30 same warning.
max_iter = 45/60 no warning.

When r= L2:

max_iter = 30/45/60 same warning

Does it matter or this is random?

Solution

The convergence time is sensitive to the data you have and a random seed. Specifically, the convergence time is linear in expectation in all three cases. SGDClassifier uses the stochastic gradient descent for optimization. Since L1 loss is only subdifferential, the L1 penalty causes the algorithm to converge noticeably slower.

Comparing with or without the L2 penalty, it is not clear what algorithm is faster. The loss function is differential. The L2 penalty may be faster in the underdetermined case. In the example below, I consider the gradient descent instead of the stochastic linear descent and regular regression to simplify the argument. Say, we aim to solve y = Xb + e, where we observe y and X only. We set the loss function to be f(b) = 0.5||y - Xb||^2. Without regularization, the solution is sol1 =(X^TX)^{-1}X^Ty and with L2 regularization, the solution is sol2 = (X^TX + lambda I)^{-1}X^Ty. In the latter case, we can guarantee that the matrix to invert is not close to singular, and, therefore, the faster convergence is expected.

In short, on average, I would expect the following number of iterations requires from smallest to largest ON AVERAGE:

L2 penalty
No penalty (potentially, with a close tie with L2 penalty)
L1 penalty

You observe the opposite order. It should be very specific to your data or random seed.

OTHER TIPS

Penalty adds an additional term to the loss function. In your case, it made your model require more iterations to converge. When penalties are added, if you see the converged value of your original loss(log) only, it will be a lower value than the case with no penalty. Which shows the advantage, adding a penalty makes your model converge to lower loss.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange