Question

In neural networks and old classification methods, we usually construct an objective function to achieve dimensionality reduction. But Deep Belief Networks (DBN) with Restricted Boltzmann Machines (RBM) learn the data structure through unsupervised learning. How does it achieve dimensionality reduction without knowing the ground truth and constructing an objective function?

Was it helpful?

Solution

As you know, a deep belief network (DBN) is a stack of restricted Boltzmann machines (RBM), so let's look at the RBM: a restricted Boltzmann machines is a generative model, which means it is able to generate samples from the learned probability distribution at the visible units (the input). While training the RBM, you teach it how your input samples are distributed, and the RBM learns how it could generate such samples. It can do so by adjusting the visible and hidden biases, and the weights in between.

The choice of the number of hidden units is completely up to you: if you choose to give it less hidden than visible units, the RBM will try to recreate the probability distribution at the input with only the number of hidden units it has. An that is already the objective: $p(\mathbf{v})$, the probability distribution at the visible units, should be as close as possible to the probability distribution of your data $p(\text{data})$.

To do that, we assign an energy function (both equations taken from A Practical Guide to Training RBMs by G. Hinton) $$E(\mathbf{v},\mathbf{h}) = -\sum_{i \in \text{visible}} a_i v_i - \sum_{j \in \text{hidden}} b_j h_j - \sum_{i,j} v_i h_j w_{ij}$$ to each configuration of visible units $\mathbf{v}$ and hidden units $\mathbf{h}$. Here, $a_i$ and $b_j$ are the biases, and $w_{ij}$ are the weights. Given this energy function, the probability of a visible vector $\mathbf{v}$ is $$p(\mathbf{v}) = \frac 1Z \sum_{\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h})}$$ With that, we know that to increase the probability of the RBM generating a training sample $\mathbf{v}^{(k)}$ (denotes the $k$-th training sample), we need to change $a_i$, $b_j$ and $w_{ij}$ so that the energy $E$ for our given $\mathbf{v}^{(k)}$ and the corresponding $\mathbf{h}$ gets lower.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top