Gradient Descent Step for word2vec negative sampling
-
16-10-2019 - |
Pergunta
For word2vec with negative sampling, the cost function for a single word is the following according to word2vec: $$ E = - log(\sigma(v_{w_{O}}^{'}.u_{w_{I}})) - \sum_{k=1}^K log(\sigma(-v_{w_{k}}^{'}.u_{w_{I}})) $$
$v_{w_{O}}^{'}$ = hidden->output word vector of the output word
$u_{w_{I}}$ = input->hidden word vector of the output word
$v_{w_{k}}^{'}$ = hidden->output word vector of the negative sampled word
$\sigma$ is the sigmoid function
And taking the derivative with respect to $v_{w_{O}}^{'}.u_{w_{j}}$ is:
$ \frac{\partial E}{\partial v_{w_{j}}^{'}.u_{w_{I}}} = \sigma(v_{w_{j}}^{'}.u_{w_{I}}) * (\sigma(v_{w_{j}}^{'}.u_{w_{I}}) - 1) $ $ if w_j = w_O $
$ \frac{\partial E}{\partial v_{w_{j}}^{'}.u_{w_{I}}} = \sigma(v_{w_{j}}^{'}.u_{w_{I}}) * \sigma(-v_{w_{j}}^{'}.u_{w_{I}}) $ $ if w_j = w_k \ for \ k = 1...K$
Then we can use chain rule to get
$ \frac{\partial E}{\partial v_{w_{j}}^{'}} = \frac{\partial E}{\partial v_{w_{j}}^{'}.u_{w_{I}}} * \frac{\partial v_{w_{j}}^{'}.u_{w_{I}}}{\partial v_{w_{j}}^{'}} $
Is my reasoning and derivative correct? I am still new to ML so any help would be great!
Solução
Looks good to me. This derivative is also presented in the paper (equations 56-58).
The paper you're linking to is the most advanced attempt - at least to best of my knowledge - to explain how word2vec works, but there is also a lot of other resources on the topic (just search for word2vec on arxiv.org). If you're interested in word2vec, you may find GloVe interesting too (see: Linking GloVe with word2vec).