Trying to understand Logistic Regression Implementation

https://datascience.stackexchange.com/questions/8657

16-10-2019
|

Question

I'm currently using the following code as a starting point to deepen my understanding of regularized logistic regression. As a first pass I'm just trying to do a binary classification on part of the iris data set.

One problem I think I have encountered is that the negative log-loss (computed with loss and stored in loss_vec) doesn't change much from one iteration to the next.

Another challenge I am facing is trying to figure out how to plot the decision boundary once I have learned the logistic regression coefficients. Using the coefficients to plot the 0.5 decision boundary is way off. This makes me think I have made a mistake somewhere

http://fa.bianp.net/blog/2013/numerical-optimizers-for-logistic-regression/

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets


iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target


X = X[:100,:]
Y = Y[:100]

def phi(t):
    # logistic function, returns 1 / (1 + exp(-t))
    idx = t > 0
    out = np.empty(t.size, dtype=np.float)
    out[idx] = 1. / (1 + np.exp(-t[idx]))
    exp_t = np.exp(t[~idx])
    out[~idx] = exp_t / (1. + exp_t)
    return out


def loss(x0, X, y, alpha):
    # logistic loss function, returns Sum{-log(phi(t))}
    #x0 is the weight vector w are the paramaters, c is the bias term
    w, c = x0[:X.shape[1]], x0[-1]
    z = X.dot(w) + c
    yz = y * z
    idx = yz > 0
    out = np.zeros_like(yz)
    out[idx] = np.log(1 + np.exp(-yz[idx]))
    out[~idx] = (-yz[~idx] + np.log(1 + np.exp(yz[~idx])))
    out = out.sum() / X.shape[0] + .5 * alpha * w.dot(w)
    return out


def gradient(x0, X, y, alpha):
    # gradient of the logistic loss
    w, c = x0[:X.shape[1]], x0[-1]
    z = X.dot(w) + c
    z = phi(y * z)
    z0 = (z - 1) * y
    grad_w = X.T.dot(z0) / X.shape[0] + alpha * w
    grad_c = z0.sum() / X.shape[0]
    return np.concatenate((grad_w, [grad_c]))


def bgd(X, y, alpha, max_iter):
    step_sizes = np.array([100,10,1,.1,.01,.001,.0001,.00001])
    iter_no = 0
    x0 = np.random.random(X.shape[1] + 1) #initialize weight vector

    #placeholder for coefficients to test against the loss function
    next_thetas = np.zeros((step_sizes.shape[0],X.shape[1]+1) )   
    J = loss(x0,X,y,alpha)
    running_loss = []
    while iter_no < max_iter:
        grad = gradient(x0,X,y,alpha)
        next_thetas = -(np.outer(step_sizes,grad)-x0)
        loss_vec = []
        for i in range(np.shape(next_thetas)[0]):
            loss_vec.append(loss(next_thetas[i],X,y,alpha))
        ind = np.argmin(loss_vec)
        x0 = next_thetas[ind]
        if iter_no % 500 == 0:
            running_loss.append(loss(x0,X,y,alpha))
        iter_no += 1  
    return next_thetas

Solution

There are several issues I see with the implementation. Some are just unnecessarily complicated ways of doing it, but some are genuine errors.

Primary takeaways

A: Try to start from the math behind the model. The logistic regression is a relatively simple one. Find the two equations you need and stick to them, replicate them letter by letter.

B: Vectorize. It will save you a lot of unnecessary steps and computations, if you step back for a bit and think of the best vectorized implementation.

C: Write more comments in the code. It will help those trying to help you. It will also help you understand each part better and maybe uncover errors yourself.

Now let's go over the code step by step.

1. The sigmoid function

Is there a reason for such a complicate implementation in phi(t)? Assuming that t is a vector (a numpy array), then all you really need is:

def phi(t):
   1. / (1. + np.exp(-t))

As np.exp() operates element-wise over arrays. Ideally, I'd implement it as a function that can also return its derivative (not necessary here, but might be handy if you try to implement a basic neural net with sigmoid activations):

def phi(t, dt = False):
   if dt:
      phi(t) * (1. - phi(t))
   else:
      1. / (1. + np.exp(-t))

2. Cost function

Usually, the logistic cost function is defined as a log cost in the following way (vectorized): $ \frac{1}{m} (-(y^T \log{(\phi(X\theta))})-(1-y^T)(\log{(1 - \phi(X\theta))}) + \frac{\lambda}{2m} \theta^{1T}\theta $ where $\phi(z)$ is the logistic (sigmoid) function, $\theta$ is the full parameter vector (including bias weight), $\theta^1$ is parameter vector with $\theta_1=0$ (by convention, bias is not regularized) and $\lambda$ is the regularization parameter.

What I really don't understand is the part where you multiply y * z. Assuming y is your label vector $y$, why are you multiplying it with your z before applying the sigmoid function? And why do you need to split the cost function into zeros and ones and calculate losses for either sample separately? I think the problem in your code really lies in this part: you must be erroneously multiplying $y$ with $X\theta$ before applying $\phi(.)$.

Also, this bit here: X.dot(w) + c. So c is your bias parameter, right? Why are you adding it to every element of $X\theta$? It shouldn't be added - it should be the first element of the vector $X\theta$. Yes, you don't regularize it, but you need to use in the "prediction" part of the loss function.

In your code, I also see the cost function as being overly complicated. Here's what I would try:

def loss(X,y,w,lam):
   #calculate "prediction"
   Z = np.dot(X,w)
   #calculate cost without regularization
   #shape of X is (m,n), shape of w is (n,1)
   J = (1./len(X)) * (-np.dot(y.T, np.log(phi(Z))) * np.dot((1-y.T),np.log(1 - phi(Z))))
   #add regularization
   #temporary weight vector
   w1 = copy.copy(w) #import copy to create a true copy
   w1[0] = 0
   J += (lam/(2.*len(X))) * np.dot(w1.T, w)
   return J

3. Gradient

Again, let's first go over the formula for the gradient of the logistic loss, again, vectorized: $\frac{1}{m} ((\phi(X\theta) - y)^TX)^T + \frac{\lambda}{m}\theta^1$. This will return a vector of derivatives (i.e. gradient) for all parameters, regularized properly (without the bias term regularized).

Here again, you've multiplied by $y$ way too soon: phi(y * z). In fact, you shouldn't have multiplied by $y$ in gradient at all.

This is what I would do for the gradient:

def gradient(X, y, w, lam):
   #calculate the prediction
   Z = np.dot(X,w)
   #temporary weight vector
   w1 = copy.copy(w) #import copy to create a true copy
   w1[0] = 0
   #calc gradient
   grad = (1./len(X)) * (np.dot((phi(Z) - y).T,X).T) + (lam/len(X)) * w1
   return grad

The actual gradient descent implementation seems ok to me, but because there are errors in the gradient and cost function implementations, it fails to deliver :/

Hope this will help you get on track.

OTHER TIPS

Below is how you can implement gradient descent in Python:

def sigmoid(z):
    s= 1/(1 + np.exp(-z))
    return s

def propagate(w, b, X, Y):

    m = X.shape[1]

    A = sigmoid(np.dot(w.T,X)+b)                                     # compute activation

    cost = -1/m * np.sum(Y * np.log(A) + (1-Y) * (np.log(1-A)))

    dz= (1/m)*(A - Y)
    dw = np.dot(X, dz.T)
    db = np.sum(dz)


    cost = np.squeeze(cost)
    grads = {"dw": dw,
             "db": db}

    return grads, cost


def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False):

    costs = []

    for i in range(num_iterations):
        m = X.shape[1]
        grads,cost = propagate(w, b, X, Y)
        b = b - learning_rate*grads["db"]
        w = w - learning_rate*grads["dw"]
        if i % 100 == 0:
            costs.append(cost)
        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))

    params = {"w": w,
              "b": b}
    return params, grads, costs



def predict(w, b, X):
    m = X.shape[1]
    Y_prediction = np.zeros((1,m))
    w = w.reshape(X.shape[0], 1)
    A = sigmoid(np.dot(w.T,X)+ b)

    for i in range(A.shape[1]):
        x_exp = np.exp(A)
        x_sum = np.sum(x_exp,axis=1,keepdims=True)
        s = np.divide(x_exp,x_sum)

    Y_prediction = 1. * (A > 0.5)

    return Y_prediction



def model(X_train, Y_train, X_test, Y_test, num_iterations = 2000, learning_rate = 0.5, print_cost = False):
    w, b = initialize_with_zeros(X_train.shape[0])

    print("learning rate:",learning_rate)

    parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost = False)

    w = parameters["w"]
    b = parameters["b"]

    Y_prediction_train = predict(w,b,X_train)
    Y_prediction_test = predict(w,b,X_test)
    d = {"costs": costs,
         "Y_prediction_test": Y_prediction_test, 
         "Y_prediction_train" : Y_prediction_train, 
         "w" : w, 
         "b" : b,
         "learning_rate" : learning_rate,
         "num_iterations": num_iterations}

    return d

d = model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 2000, learning_rate = 0.005, print_cost = True)

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange