Question

Recently I've studied the backpropagation network and have done some manual exercise. After that, I came up with a question( maybe doesn't make sense): is there any thing important in following two different replacement methods: 1. Incremental Training: weights are immediately updated once all the delta Wij's are known and before presenting the next training vector. 2. Batch Training: delta Wij's are computed and stored for each exemplar training vector. However, the delta Wij's are not immediately used to update the weights. Weight updating is done at the end of a training epoch.

I've googled for a while but haven't found any results.

Was it helpful?

Solution

So what you are referring to is the two modes to perform gradient descent learning. In batch mode, changes to the weight matrix are accumulated over an entire presentation of the training data set (one 'epoch'); online training updates the weight after presentation of each vector comprising the training set.

I believe the consensus is that online training is superior because it converges much faster (most studies report no apparent differences in accuracy). (See e.g., Randall Wilson & Tony Martinez, The General Inefficiency of Batch Training for Gradient Descent Learning, In Neural Networks (2003).

The reason why online training converges faster is that it can follow curves in the error surface over each epoch. The practical significance of this is that you can use a larger learning rate (and therefore converge with fewer cycles through the training data).

Put another way, the accumulated weight change for batch training increases with the size of the training set. The result is that batch training uses large steps at each iteration, and therefore misses local minima in the error space topology--your solver oscillates rather than converges.

Batch training is usually the 'default' (most often used in ML textbooks, etc.) and there's nothing wrong with using it as long as it converges within your acceptable time limits. Again, the difference in performance (resolution, or classification accuracy) is small or negligible.

OTHER TIPS

Yes there is a difference between these two methods. The deltas that get computed are a function of the input vector and of the weights of the network. If you change the weights, the deltas that are computed from the next input vector will be different than if you didn't change the weights.

So, for the very first input vector, the same deltas will get computed regardless of the method you choose. Now, for the Successive method, the weights in the network will change, while in the Simultaneous method, the weights will remain the same for now. When the 2nd input vector is presented, both methods will now produce different deltas, since the weights are different between the two networks.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top