Pergunta

Sparse methods such as LASSO contains a parameter $\lambda$ which is associated with the minimization of the $l_1$ norm. Higher the value of $\lambda$ ($>0$) it means that more coefficients will be shrunk to zero. What is unclear to me is that how does this method decide which coefficients to shrink to zero?

  • If $\lambda = 0.5$ then does it mean that those coefficients whose values are less than or equal to 0.5 will become zero? So in other words, whatever be the value of $\lambda$, the coefficients whose values fall within $\lambda$ will be turned off/become zero? OR is there some other meaning to the value of $\lambda$?
  • Can $\lambda$ be negative?
Foi útil?

Solução

When we implement penalized regression models we are saying that we are going to add a penalty to the sum of the squared errors.

Recall that the sum of squared errors is the following and that we are trying to minimize this value with Least Squares Regression:

$$SSE = \sum_{i=1}^{n}(y_i-\hat{y_i})^2$$

When the model overfits or there is collinearity present, the estimates for the coefficients for our least squares model may be higher than they should be.

How do we fix this? We use regularization. What this means is that we add a penalty to the sum of squared errors thereby limiting how large the parameter estimates can get.

For Ridge Regression this looks like this:

$$SSE_{L2 norm} = \sum_{i=1}^{n}(y_i-\hat{y_i})^2 + \lambda \sum_{j=1}^{P}\beta_j^2$$

Notice what is different with this model. Here, we add a L2 regularization penalty to the end of the SSE. What this does is add the multiplication of $\lambda$ by the square of the parameter estimations as a penalty to the SSE. This limits how large the parameter estimates can get. As you increase the "shrinkage parameter" $\lambda$ the parameter estimates are shrunk more towards zero. What is important to note with Ridge Regression, is that this model shrinks the values towards zero, but not to zero.

You may also use the LASSO Regression technique as shown below:

$$SSE_{L1 norm} = \sum_{i=1}^{n}(y_i-\hat{y_i})^2 + \lambda \sum_{j=1}^{P} \lvert{\beta_j^2}\rvert$$

Notice here that the change by adding the L1 penalty is very similar. However, this difference here is that we are now penalizing the absolute value of the coefficients. This allows shrinkage to zero and can be considered a form of feature selection.

In either case both methods penalize model complexity and are parsimonious!

Two things to note:

  1. In answer to your question, no, $\lambda$ cannot be negative. Why? This would make no sense. $\lambda$ is multiplied by either the L2 or L1 norm to add a penalty to the SSE. If instead, you had a negative $\lambda$ you would actually be rewarding the model's complexity not penalizing it.

  2. When you have a value of $\lambda = 0$ you have no penalty and just have regular least squares regression!

Outras dicas

@Ethan is correct about the formulation of the lasso penalty, and I think it's particularly important to understand it in that form (for one thing, because that same penalty can work with other models like neural networks, tree models, generalized linear models, ...).

But, to your question:

If $\lambda=0.5$ then does it mean that those coefficients whose values are less than or equal to 0.5 will become zero? So in other words, whatever be the value of $\lambda$, the coefficients whose values fall within $\lambda$ will be turned off/become zero? OR is there some other meaning to the value of $\lambda$?

The answer is mostly yes, under some assumptions. For vanilla LASSO (OLS plus the L1-penalty), if the covariates are orthonormal, then the LASSO coefficients can be written in terms of the OLS coefficients:

$$\beta_i^{\text{LASSO}} = \begin{cases} \beta_i^{\text{OLS}} - \lambda & \text{if $\beta_i^{\text{OLS}} \geq \lambda$} \\ \beta_i^{\text{OLS}} + \lambda & \text{if $\beta_i^{\text{OLS}} \leq -\lambda$} \\ 0 & \text{if $|\beta_i^{\text{OLS}}|\leq\lambda$.} \end{cases}$$ (You'll see that written more concisely as $\operatorname{sgn}(\beta_i^{\text{OLS}}) \left(|\beta_i^{\text{OLS}}|-\lambda\right)_+$.)

See e.g.:
https://stats.stackexchange.com/q/17781/232706
http://pages.cs.wisc.edu/~jerryzhu/cs731/regression.pdf (section 3)
https://stats.stackexchange.com/q/342547/232706

Licenciado em: CC-BY-SA com atribuição
scroll top