Gradient descent vs fminunc
-
03-07-2021 - |
Question
I am trying to run gradient descent and cannot get the same result as octaves built-in fminunc, when using exactly the same data
My Code is
%for 5000 iterations
for iter = 1:5000
%%Calculate the cost and the new gradient
[cost, grad] = costFunction(initial_theta, X, y);
%%Gradient = Old Gradient - (Learning Rate * New Gradient)
initial_theta = initial_theta - (alpha * grad);
end
Where costFunction calucates the cost and gradient, when given an example (X,y) and parameters(theta).
a built-in octave function fminunc also calling costFunction and with the same data finds a much much better answer in far fewer iterations.
Given that octave uses the same cost function i assume the costFunction is correct.
I have tried decreasing the learning rate in case i am hitting a local minima and increasing the number of iterations, the cost stops decreasing so i think it seems that it has found the minimum, but the final theta still has a much larger cost and is no where near as accurate
even if fminunc is using a better alogoritm hould gradient descent eventually find the same answer with enough iterations and a smaller learning rate?
or can anyone see if i am doing anything wrong?
Thank you for any and all help.
Solution
Your comments are wrong, but the algorithm is good.
In gradient descent it's easy to fall into numerical problems, then I suggest to perform feature normalization.
Also, if you're unsure about your learning rate, try to adjust it dynamically. Something like:
best_cost = Inf;
best_theta = initial_theta;
alpha = 1;
for iter = 1:500
[cost, grad] = costFunction(best_theta, X_reg, y);
if (cost < best_cost)
best_theta = best_theta - alpha * grad;
best_cost = cost;
else
alpha = alpha * 0.99
end
end
Moreover remember that different answers can give the same decision boundaries. For example for hypothesis h(x) = x(0) + theta(1) * x(1) + theta(2) * x(2) these answers give the same boundary:
theta = [5, 10, 10];
theta = [10, 20, 20];