Gradient Descent Step for word2vec negative sampling

https://datascience.stackexchange.com/questions/5615

16-10-2019
|

Pergunta

For word2vec with negative sampling, the cost function for a single word is the following according to word2vec: $$ E = - log(\sigma(v_{w_{O}}^{'}.u_{w_{I}})) - \sum_{k=1}^K log(\sigma(-v_{w_{k}}^{'}.u_{w_{I}})) $$

$v_{w_{O}}^{'}$ = hidden->output word vector of the output word

$u_{w_{I}}$ = input->hidden word vector of the output word

$v_{w_{k}}^{'}$ = hidden->output word vector of the negative sampled word

$\sigma$ is the sigmoid function

And taking the derivative with respect to $v_{w_{O}}^{'}.u_{w_{j}}$ is:

$ \frac{\partial E}{\partial v_{w_{j}}^{'}.u_{w_{I}}} = \sigma(v_{w_{j}}^{'}.u_{w_{I}}) * (\sigma(v_{w_{j}}^{'}.u_{w_{I}}) - 1) $ $ if w_j = w_O $

$ \frac{\partial E}{\partial v_{w_{j}}^{'}.u_{w_{I}}} = \sigma(v_{w_{j}}^{'}.u_{w_{I}}) * \sigma(-v_{w_{j}}^{'}.u_{w_{I}}) $ $ if w_j = w_k \ for \ k = 1...K$

Then we can use chain rule to get

$ \frac{\partial E}{\partial v_{w_{j}}^{'}} = \frac{\partial E}{\partial v_{w_{j}}^{'}.u_{w_{I}}} * \frac{\partial v_{w_{j}}^{'}.u_{w_{I}}}{\partial v_{w_{j}}^{'}} $

Is my reasoning and derivative correct? I am still new to ML so any help would be great!

Solução

Looks good to me. This derivative is also presented in the paper (equations 56-58).

The paper you're linking to is the most advanced attempt - at least to best of my knowledge - to explain how word2vec works, but there is also a lot of other resources on the topic (just search for word2vec on arxiv.org). If you're interested in word2vec, you may find GloVe interesting too (see: Linking GloVe with word2vec).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange