Вопрос

I have seen at different places saying that: l1 regularization penalizes weights more than l2.

But the derivative of l1 norm is $\lambda$ and l2 norm is 2$\lambda$w. So l1 regularization subtracts smaller value than l2. Then why is it called that l1 penalizes weights more than l2. Or is it incorrect to say it like this?

Это было полезно?

Решение

That is generally not true, to be more accurate we can say that L1 promotes sparsity. if a weight is larger than 1 then L2 cares more about it than L1 while if a weight is less than 1 then L1 cares more about it than L2.

For a quick example imagine two weights, $w_1 = 15$ and $w_2 = 0.02$, let's imagine that the model considers reducing both of those weights by a small number $\epsilon=0.001$ (for the sake of simplicity reducing either weight by $\epsilon$ increases the model-error part of the loss at the same rate). Now, let's calculate how much the loss changes for both weights under both regularization terms.

For $w_1$ and L1: $\Delta L=|15|-|14.999| = 0.001 $ (Loss decreased by 0.001)

For $w_1$ and L2: $\Delta L=|15|^2-|14.999|^2 = 0.029999 $ (Loss decreased by 0.029999)

We can see that for $w_1$ the loss decreases about 28 times more for L2 compared to L1. So L2 regularizes the weights more in this case (i.e. L2 is willing to sacrifice more model complexity to regularize this weight more).

For $w_2$ and L1: $\Delta L=|0.02|-|0.019| = 0.001 $ (Loss decreased by 0.001)

For $w_2$ and L2: $\Delta L=|0.02|^2-|0.019|^2 = 0.000039 $ (Loss decreased by 0.000039)

We can see that for $w_2$ the loss decreases about 24 times more for L1! So for smaller weights L1 is willing to sacrifice more model expressiveness just to reduce the weights while L2 barely pays any attention (compared to L1) to weights close to 0.

It's possible that in practice you perceive L1 regularizes weights more than L2 becauses when you perform L1 regularization you notice a lot of the weights tend to be exactly 0 while in L2 almost no weights are exactly 0 so you might tend to think that L1 is "stronger", but that just comes from the point we notices above where L2 barely pays any attention to weights close to 0 while L1 still sees a benefit of $\epsilon$ no matter how small the weight was before the $\epsilon$ change and thus it promotes sparsity.

There are a lot of practical and theoretical differences between L1 and L2 regularization, too many to list here. For example one practical difference is that L1 can be a form of feature elimination in linear regression. A theoretical difference is how L2 regularization comes from the MAP of a Normal Distributed prior while the L1 comes from a Laplacean prior.

EDIT: I just reread your post and yes, looking at the derivatives you should also get the same insight. For $w>1 \Rightarrow w\lambda > \lambda$ thus L2 regularizes large weights more while for $w<1 \Rightarrow w\lambda < \lambda$ thus L1 regularizes small weights more.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с datascience.stackexchange
scroll top