Pergunta

One of the discussed nice aspects of the procedure that Vowpal Wabbit uses for updates to sgd pdf is so-called weight invariance, described in the linked as:

"Among these updates we mainly focus on a novel set of updates that satisfies an additional invariance property: for all importance weights of h, the update is equivalent to two updates with importance weight h/2. We call these updates importance invariant."

What does this mean and why is it useful?

Foi útil?

Solução

Often different data samples have different weighting ( eg the costs of misclassification error for one group of data is higher than for other classes). Most error metrics are of the form $\sum_i e_i$ where e_i is the loss ( eg squared error) on data point $i$. Therefore weightings of the form $\sum_i w_i e_i$ are equivalent to duplicating the data w_i times (eg for w_i integer).

One simple case is if you have repeated data - rather than keeping all the duplicated data points, you just "weight" your one repeated sample by the number of instances.

Now whilst this is easy to do in a batch setting, it is hard in vowpal wabbits online big data setting: given that you have a large data set, you do not just want to represent the data n times to deal with the weighting ( because it increases your computational load). Similarly, just multiplying the gradient vector by the weighting - which is correct in batch gradient descent - will cause big problems for stochastic/online gradient descent: essentially you shoot off in one direction ( think of large integer weights) then you shoot off in the other - causing significant instability. SGD essentially relies on all the errors to be of roughly the same order ( so that the learning rate can be set appropriately). So what they propose is to ensure that the update for training sample x_i with weight n is equivalent to presenting training sample x_i n times consecutively.

The idea being that presenting it consecutively reduces the problem because the error gradient (for that single example $x_i$) reduces for each consecutive presentation and update (as you get closer & closer to the minimum for that specific example). In other words the consecutive updates provides a kind of feedback control.

To me it sounds like you would still have instabilities (you get to zero error on x_i, then you get to zero error on x_i+1,...). the learning rate will need to be adjusted to take into account the size of the weights.

Licenciado em: CC-BY-SA com atribuição
scroll top