Intuitive explanation of Noise Contrastive Estimation (NCE) loss?

https://datascience.stackexchange.com/questions/13216

16-10-2019
|

Question

I read about NCE (a form of candidate sampling) from these two sources:

Tensorflow writeup

Original Paper

Can someone help me with the following:

A simple explanation of how NCE works (I found the above difficult to parse and get an understanding of, so something intuitive that leads to the math presented there would be great)
After point 1 above, a naturally intuitive description of how this is different from Negative Sampling. I can see that there's a slight change in the formula but could not understand the math. I do have an intuitive understanding of negative sampling in the context of word2vec - we randomly choose some samples from the vocabulary V and update only those because |V| is large and this offers a speedup. Please correct if wrong.
When to use which one and how is that decided? It would be great if you could include examples(possibly easy to understand applications)
Is NCE better than Negative Sampling? Better in what manner?

Thank you.

Solution

Taken from this post:https://stats.stackexchange.com/a/245452/154812

The issue

There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).

Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.

A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.

The (a) solution

In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.

The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.

For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.

This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.

References

[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data

[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning

[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.

[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.

OTHER TIPS

Honestly there is no intuitive way to understand why NCE loss will work without deeply understanding its math. To understand the math, you should read the original paper.

The reason why NCE loss will work is because NCE approximates maximum likelihood estimation (MLE) when the ratio of noise to real data $k$ increases.

The TensorFlow implementation works in practice. Basically, for each data $(x, y)$, $y$ is the labeled class from the data, TensorFlow NCE loss samples $k$ classes from noise distributions. We calculate a special version of the digits for each of the classes (1 from data + $k$ from noise distributions) using equation

$$\Delta s_{\theta^0}(w,h) = s_{\theta^0}(w,h) - \log kP_n(w)$$

Where $P_n(w)$ is the noise distribution. With the digits for each classes calculated, TensorFlow use the digits to compute softmax loss for binary classification (log loss in logistic regression) for each of the classes, and add these losses together as the final NCE loss.

However, its implementation is conceptually wrong because the ratio of noise to real data $k$ is different to the number of classes $n$ sampled from noise distributions. TensorFlow NCE does not provide a variable for the noise to data ratio, and implicitly assumes $n=k$ which I think is conceptually incorrect.

The original NCE papers skipped the derivations of the proof a lot so that it is really hard to understand NCE. To understand the math about NCE easier, I have a blog post on this annotating the math from the NCE papers:

https://leimao.github.io/article/Noise-Contrastive-Estimation/.

College sophomore or above should be able to understand it.

Basically, this is selecting a sample from the true distribution which consists of the true class and some other noisy class labels. Then taking the softmax over it.

This is based on sampling words from true distribution and noise distribution.

Here the basic Idea is to train logistic regression classifier which can separate the samples obtained from true distribution and sample obtained from noise distribution. Remember When we are talking about the samples obtained from the true distribution we are talking about only one sample which is the true class obtained from the model distribution.

Here I have explained about NCE loss and how it differ from the NCE loss .

Noise Contrastive Estimation : Solution for expensive Softmax .

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange