Training a neural network with constrained units

https://stackoverflow.com/questions/15578038

29-03-2022
|

Question

Motivation:

The state of the art algorithm for object recognition is a deep convolutional neural net trained through backpropagation, where the main problem is getting the network to settle in a good local minima: http://books.nips.cc/papers/files/nips25/NIPS2012_0534.pdf

It is possible to record spike counts from the brain from neurons that support object recognition, and it is reasonable to claim that the neural network that approximates the response of these neurons is in a good local minima. http://www.sciencedirect.com/science/article/pii/S089662731200092X

If you were to constrain a subset of units in a neural net to reproduce certain values for certain inputs (say for example, the spike counts recorded from neurons in response to these images), and then reduce the error by a constrained gradient descent, it may be able to force the network to settle in a good local minima.

Precise question:

What would be the most computationally efficient way to alter the weights of a neural network in the direction that maximizes the reduction in error given that some neurons in the network must have certain pre-determined values?

Progress thus far:

This seems to be a very difficult Lagrange multiplier problem, and after doing some work on it and searching for existing literature on the topic, I was wondering if anyone had heard of similar work.

Solution

Your best bet is Kullback-Liebler Divergence (KL). It allows you to set the value you wish your neurons to be close to. In python it's,

def _binary_KL_divergence(p, p_hat):
    """
    Computes the a real, KL divergence of two binomial distributions with
    probabilities p  and p_hat respectively.
    """
    return (p * np.log(p / p_hat)) + ((1 - p) * np.log((1 - p) / (1 - p_hat)))

where p is the constrained value, and p_hat is the average activation value (or neuron value) of your samples. It is as simple as adding the term to the objective function. So, if the algorithm minimizes the square error ||H(X) - y||^2, the new form would be ||H(X) - y||^2 + KL_divergence_term.

As part of the cost function, it penalizes the average activations that diverge from p whether higher or lower (Figure 1). How the weight updates depends on the partial differentiation of the new objective function.

enter image description here

                     (Figure 1 : KL-Divergence Cost when `p = 0.2)

In fact, I burrowed this idea from Sparse Auto-encoders, where more details can be seen at Lecture Notes on Sparse Autoencoders.

Good luck!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow