Stochastic gradient descent based on vector operations?

https://datascience.stackexchange.com/questions/1246

16-10-2019
|

문제

let's assume that I want to train a stochastic gradient descent regression algorithm using a dataset that has N samples. Since the size of the dataset is fixed, I will reuse the data T times. At each iteration or "epoch", I use each training sample exactly once after randomly reordering the whole training set.

My implementation is based on Python and Numpy. Therefore, using vector operations can remarkably decrease computation time. Coming up with a vectorized implementation of batch gradient descent is quite straightforward. However, in the case of stochastic gradient descent I can not figure out how to avoid the outer loop that iterates through all the samples at each epoch.

Does anybody know any vectorized implementation of stochastic gradient descent?

EDIT: I've been asked why would I like to use online gradient descent if the size of my dataset is fixed.

From [1], one can see that online gradient descent converges slower than batch gradient descent to the minimum of the empirical cost. However, it converges faster to the minimum of the expected cost, which measures generalization performance. I'd like to test the impact of these theoretical results in my particular problem, by means of cross validation. Without a vectorized implementation, my online gradient descent code is much slower than the batch gradient descent one. That remarkably increases the time it takes to the cross validation process to be completed.

EDIT: I include here the pseudocode of my on-line gradient descent implementation, as requested by ffriend. I am solving a regression problem.

Method: on-line gradient descent (regression)
Input: X (nxp matrix; each line contains a training sample, represented as a length-p vector), Y (length-n vector; output of the training samples)
Output: A (length-p+1 vector of coefficients)

Initialize coefficients (assign value 0 to all coefficients)
Calculate outputs F
prev_error = inf
error = sum((F-Y)^2)/n
it = 0
while abs(error - prev_error)>ERROR_THRESHOLD and it<=MAX_ITERATIONS:
    Randomly shuffle training samples
    for each training sample i:
        Compute error for training sample i
        Update coefficients based on the error above
    prev_error = error
    Calculate outputs F
    error = sum((F-Y)^2)/n
    it = it + 1

[1] "Large Scale Online Learning", L. Bottou, Y. Le Cunn, NIPS 2003.

해결책

First of all, word "sample" is normally used to describe subset of population, so I will refer to the same thing as "example".

Your SGD implementation is slow because of this line:

for each training example i:

Here you explicitly use exactly one example for each update of model parameters. By definition, vectorization is a technique for converting operations on one element into operations on a vector of such elements. Thus, no, you cannot process examples one by one and still use vectorization.

You can, however, approximate true SGD by using mini-batches. Mini-batch is a small subset of original dataset (say, 100 examples). You calculate error and parameter updates based on mini-batches, but you still iterate over many of them without global optimization, making the process stochastic. So, to make your implementation much faster it's enough to change previous line to:

batches = split dataset into mini-batches
for batch in batches:

and calculate error from batch, not from a single example.

Though pretty obvious, I should also mention vectorization on per-example level. That is, instead of something like this:

theta = np.array([...])  # parameter vector
x = np.array([...])      # example
y = 0                    # predicted response
for i in range(len(example)):
    y += x[i] * theta[i]
error = (true_y - y) ** 2  # true_y - true value of response

you should definitely do something like this:

error = (true_y - sum(np.dot(x, theta))) ** 2

which, again, easy to generalize for mini-batches:

true_y = np.array([...])     # vector of response values
X = np.array([[...], [...]]) # mini-batch
errors = true_y - sum(np.dot(X, theta), 1)
error = sum(e ** 2 for e in errors)

다른 팁

Check out the partial_fit method of scikit's SGD classifier. You have control over what you call with it: you can do it "true" online learning by passing an instance at a time, or you can batch up instances into mini-batches if all your data are available in an array. If they are, you can slice the array to provide the minibatches.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange