Gradient Descent: Do we iterate on ALL of the training set with each step in GD? or Do we change GD for each training set?

Question 1

The second way you are describing it is the correct way to perform Gradient Descent. The true gradient is dependent on the whole data set, so one iteration of gradient descent requires using all of the data set. (This is true for any learning algorithm where you can take the gradient)

The "first way" is close to something that is called Stochastic Gradient Descent. The idea here is that using the whole data set for one update might be overkill, especially if some of the data points are redundant. In this case, we pick a random point from the data set - essentially setting m=1. We then update based on successive selections of single points in the data set. This way we can do m updates at about the same cost as one update of Gradient Descent. But each update is a bit noisy, which can make convergence to the final solution difficult.

The compromise between these approaches is called "MiniBatch". Taking the gradient of the whole data set is one full round of "batch" processing, as we need the whole data set on hand. Instead we will do a mini batch, selecting only a small subset of the whole data set. In this case we set k, 1 < k < m, where k is the number of points in the mini batch. We select k random data points to create the gradient from at every iteration, and then perform the update. Repeat until convergence. Obviously, increasing / decreasing k is a tradeoff between speed and accuracy.

Note: For both stochastic & mini batch gradient descent, it is important to shuffle / select randomly the next data point. If you use the same iteration order for each data point, you can get really weird / bad results - often diverging away from the solution.

Question 2

In the case of batch gradient descent (take all samples), your solution will converge faster. In the case of stochastic gradient descent (take one sample at a time) the convergence will be slower.

When the training set is not huge, use the batch gradient descent. But there are situations where the training set is not fixed. For eg. the training happens on the fly - you keep getting more and more samples and update your vector accordingly. In this case you have to update per sample.