Question

So this is the formula for the regularized logistic regression cost function:

enter image description here

$x^{(i)}$ - the $i$'th training example

$\theta_j$ - the parameter of the $j$'th feature

$m$ - the number of training examples

$n$ - the number of features

$y_{(i)}$ - the actual outcome of the $i$'th training example, such that $y \in \{0, 1\}$

$\lambda$ - the regularization term

$h_\theta$ - the hypothesis function that produces a prediction in the interval $(0, 1)$

$h_\theta(x^{(i)})$ - predicted value of the $i$'th training example, such that:

$h_\theta(x^{(i)}) = \sigma(\theta^{T}x^{(i)})$, where $\sigma(z) = \frac{1}{1+e^{-z}}$ (the sigmoid/logistic function)

My question is about that last term:

enter image description here

From my understanding, in order to do regularization, we need to find the sum of all the squares of the parameters $\theta$. That is clear enough. Also we multiply this sum by the regularization term $\lambda$, which we can change if we think the data is overfit/underfit. Good. Then, for convenience we divide this term by $2$ such that when we take the derivative we will get rid of that $2$ that would come from the exponent of $\theta$. All clear so far. BUT, why in the world do we also divide by $m$ (the number of training examples)? We divide the left term by $m$ in order to find the average error, and that makes sense since we have $m$ examples, therefore $m$ errors and after we find the sum of these $m$ errors, we would need to divide by $m$ to get the average error. But in this right term that I am confused about, we find the sum of the squares of the features, and the number of features is $n$. If we want to find the average of all the $\theta_j^2$ wouldn't we need to divide by $n$ instead of $m$, since we have $n$ features. Why does it make sense to divide that sum by $m$ and shouldn't we divide it by $n$?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top