Intuition Behind Restricted Boltzmann Machine (RBM)

https://datascience.stackexchange.com/questions/15595

16-10-2019
|

Question

I went through Geoff Hinton's Neural Networks course on Coursera and also through introduction to restricted boltzmann machines, still I didn't understand the intuition behind RBMs.

Why do we need to compute energy in this machine? And what is the use of the probability in this machine? I also saw this video. In the video, he just wrote the probability and energy equations before the computation steps and didn't appear to use it anywhere.

Adding to the above, I am not sure what the likelihood function is for?

Solution

RBM's are an interesting beast. To answer your question, and to jog my memory on them, I'll derive RBMs and talk through the derivation. You mentioned that you're confused on the likelihood, so my derivation will be from the perspective of trying to maximize the likelihood. So let's begin.

RBMs contain two different sets of neurons, visible and hidden, I'll denote them $v$ and $h$ respectively. Given a specific configuration of $v$ and $h$, we map it the probability space.

$$p(v,h) = \frac{e^{-E(v,h)}}{Z}$$

There are a couple things more to define. The surrogate function we use to map from a specific configuration to the probability space is called the energy function $E(v,h)$. The $Z$ constant is a normalization factor to ensure that we actually map to the probability space. Now let's get to what we're really looking for; the probability of a set of visible neurons, in other words, the probability of our data. $$Z = \sum_{v \in V}\sum_{h \in H}e^{-E(v,h)}$$ $$p(v)=\sum_{h \in H}p(v,h)=\frac{\sum_{h \in H}e^{-E(v,h)}}{\sum_{v \in V}\sum_{h \in H}e^{-E(v,h)}}$$

Although there are a lot of terms in this equation, it simply comes down to writing the correct probability equations. Hopefully, so far, this has helped you realize why we need energy function to calculate the probability, or what is done more usually the unnormalized probability $p(v)*Z$. The unnormalized probability is used because the partition function $Z$ is very expensive to compute.

Now let's get to the actual learning phase of RBMs. To maximize likelihood, for every data point, we have to take a gradient step to make $p(v)=1$. To get the gradient expressions it takes some mathematical acrobatics. The first thing we do is take the log of $p(v)$. We will be operating in the log probability space from now on in order to make the math feasible.

$$\log(p(v))=\log[\sum_{h \in H}e^{-E(v,h)}]-\log[\sum_{v \in V}\sum_{h \in H}e^{-E(v,h)}]$$ Let's take the gradient with respect to the paremeters in $p(v)$

\begin{align} \frac{\partial \log(p(v))}{\partial \theta}=& -\frac{1}{\sum_{h' \in H}e^{-E(v,h')}}\sum_{h' \in H}e^{-E(v,h')}\frac{\partial E(v,h')}{\partial \theta}\\ & + \frac{1}{\sum_{v' \in V}\sum_{h' \in H}e^{-E(v',h')}}\sum_{v' \in V}\sum_{h' \in H}e^{-E(v',h')}\frac{\partial E(v,h)}{\partial \theta} \end{align}

Now I did this on paper and wrote the semi-final equation down as to not waste a lot of space on this site. I recommend you derive these equations yourself. Now I'll write some equations down that will help out in continuing our derivation. Note that: $Zp(v,h)=e^{-E(v,h')}$, $p(v)=\sum_{h \in H}p(v,h)$ and that $p(h|v) = \frac{p(v,h)}{p(h)}$

\begin{align} \frac{\partial log(p(v))}{\partial \theta}&= -\frac{1}{p(v)}\sum_{h' \in H}p(v,h')\frac{\partial E(v,h')}{\partial \theta}+\sum_{v' \in V}\sum_{h' \in H}p(v',h')\frac{\partial E(v',h')}{\partial \theta}\\ \frac{\partial log(p(v))}{\partial \theta}&= -\sum_{h' \in H}p(h'|v)\frac{\partial E(v,h')}{\partial \theta}+\sum_{v' \in V}\sum_{h' \in H}p(v',h')\frac{\partial E(v',h')}{\partial \theta} \end{align}

And there we go, we derived maximum likelihood estimation for RBM's, if you want you can write the last two terms via expectation of their respective terms (conditional, and joint probability).

Notes on energy function and stochasticity of neurons.

As you can see above in my derivation, I left the definition of the energy function rather vague. And the reason for doing that is that many different versions of RBM implement various energy functions. The one that Hinton describes in the lecture linked above, and shown by @Laurens-Meeus is: $$E(v,h)=−a^Tv−b^Th−v^TWh.$$

It might be easier to reason about the gradient terms above via the expectation form.

$$\frac{\partial \log(p(v))}{\partial \theta}= -\mathop{\mathbb{E}}_{p(h'|v)}\frac{\partial E(v,h')}{\partial \theta}+\mathop{\mathbb{E}}_{p(v',h')}\frac{\partial E(v',h')}{\partial \theta}$$

The expectation of the first term is actually really easy to calculate, and that was the genius behind RBMs. By restricting the connection the conditional expectation simply becomes a forward propagation of the RBM with the visible units clamped. This is the so called wake phase in Boltzmann machines. Now calculating the second term is much harder and usually Monte Carlo methods are utilized to do so. Writing the gradient via average of Monte Carlo runs:

$$\frac{\partial \log(p(v))}{\partial \theta}\approx -\langle \frac{\partial E(v,h')}{\partial \theta}\rangle_{p(h'|v)}+\langle\frac{\partial E(v',h')}{\partial \theta}\rangle_{p(v',h')}$$

Calculating the first term is not hard, as stated above, therefore Monte-Carlo is done over the second term. Monte Carlo methods use random successive sampling of the distribution, to calculate the expectation (sum or integral). Now this random sampling in classical RBM's is defined as setting a unit to be either 0 or 1 based on its probability stochasticly, in other words, get a random uniform number, if it is less than the neurons probability set it to 1, if it is greater than set it to 0.

OTHER TIPS

In addition to the existing answers, I would like to talk about this energy function, and the intuition behind that a bit. Sorry if this is a bit long and physical.

The energy function describes a so-called Ising model, which is a model of ferromagnetism in terms of statistical mechanics / quantum mechanics. In statistical mechanics, we use a so-called Hamiltonian operator to describe the energy of a quantum-mechanical system. And a system always tries to be in the state with the lowest energy.

Now, the Ising model basically describes the interaction between electrons with a spin $\sigma_k$ of either +1 or -1, in presence of an external magnetic field $h$. The interaction between two electrons $i$ and $j$ is described by a coefficient $J_{ij}$. This Hamiltonian (or energy function) is $$\hat{H} = \sum_{i,j} J_{ij} \sigma_i \sigma_j - \mu \sum_j h_j \sigma_j$$ where $\hat{H}$ denotes the Hamiltonian. A standard procedure to get from an energy function to the probability, that a system is in a given state (i.e. here: a configuration of spins, e.g. $\sigma_1 = {+1}, \sigma_2 = {-1}, ...$) is to use the Boltzmann distribution, which says that at a temperature $T$, the probability $p_i$ of the system to be in a state $i$ with energy $E_i$ is given by $$p_i = \frac{\exp(-E_i/kT)}{\sum_{i}\exp(-E_i/kt)}$$ At this point, you should recognize that these two equations are the exact same equations as in the videos by Hinton and the answer by Armen Aghajanyan. This leads us to the question:

What does the RBM have to do with this quantum-mechanical model of ferromagnetism?

We need to use a final physical quantity: the entropy. As we know from thermodynamics, a system will settle in the state with the minimal energy, which also corresponds to the state with the maximal entropy.

As introduced by Shanon in 1946, in information theory, the entropy $H$ can also be seen as a measure of the information content in $X$, given by the following sum over all possible states of $X$: $$H(X) = -\sum_i P(x_i) \log P(x_i)$$ Now, the most efficient way to encode the information content in $X$, is to use a way that maximizes the entropy $H$.

Finally, this is where we get back to RBMs: Basically, we want this RBM to encode as much information as possible. So, as we have to maximize the (information-theoretical) entropy in our RBM-System. As proposed by Hopfield in 1982, we can maximize the information-theoretical entropy exactly like the physical entropy: by modelling the RBM like the Ising model above, and use the same methods to minimize the energy. And that is why we need this strange energy function for in an RBM!

The nice mathematical derivation in Armen Aghajanyan's answer shows everything we need to do, to minimize the energy, thus maximizing entropy and storing / saving as much information as possible in our RBM.

_{PS: Please, dear physicists, forgive any inaccuracies in this engineer's derivation. Feel free to comment on or fix inaccuracies (or even mistakes).}

The answer of @Armen has gave myself a lot of insights. One question hasn't been answered however.

The goal is to maximize the probability (or likelihood) of the $v$. This is correlated to minimizing the energy function related to $v$ and $h$:

$$E(v,h) = -a^{\mathrm{T}} v - b^{\mathrm{T}} h -v^{\mathrm{T}} W h$$

Our variables are $a$, $b$ and $W$, which have to be trained. I'm quite sure this training will be the ultimate goal of the RBM.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange