Neural Network--significantly uneven training progress across all input vectors

https://stackoverflow.com/questions/9797502

25-05-2021
|

Question

I am implementing a feed-forward Neural Network which trains using backpropagation. When I output the error rate after each test case it learns - I notice that after a number of epochs it starts to learn certain test cases very well but others very badly. i.e. certain test cases have very low error but others have very high error.

Essentially, after a few epochs I notice that the Mean Squared Error stagnates to the following pattern- (each line represents MSE after a single testcase).

0.6666666657496451
0.6666666657514261
1.5039854423139616E-10
1.4871467103001578E-10
1.5192940136144856E-10
1.4951558809679557E-10
0.6666521719715195
1.514803547256445E-10
1.5231135866323182E-10
0.6666666657507451
1.539071732985272E-10

Could there be any possible reason(s) why this is happening ?

Initially I thought these cases causing high error rates could just be outliers - but there are too many of them as the pattern suggests. Could it be that my learner has just reached a local minima and needs some momentum to get out of it ?

Solution

My answer is directed to a possible solution to the "uneven" progress in training your Classifier. Regarding "why" you are seeing that behavior, i defer. In particular, i'm reluctant to attempt to attribute causes to artifacts i observe mid-training--i.e., is it the data? Or the MLP implementation? Or the tunable configuration i selected? The fact is that it's the interaction of your Classifier with the data that has caused this observation, rather than some inherent feature in either one.

It's not uncommon for a classifier to learn certain input vectors quite well and also quite quickly--i.e., [observed - predicted]^2 becomes very small after only a small number of cycles/epochs--and for the same classifier to fail to repeatedly fail (and fail to improve) on other input vectors.

To successfully complete the training of your Classifier, Boosting is the textbook answer for the problem described in your Question.

Before going any further though, a small mistake in your configuration/setup could also account for the behavior you observed.

In particular, perhaps verify these items in your config:

are your input vectors properly coded--e.g., so that their range is [-1, 1]?
have you correctly coded your response variables (i.e, 1-of-C coding)?
have you selected a reasonable initial learning rate and momentum term? And have you attempted training with learning rate values adjusted on either side of that initial learn rate? t

In any event, assuming those configuration and setup issues are ok, Here are the relevant implementation details regarding Boosting (which is, strictly, a technique in which multiple classifiers are combined) works like this:

after some number of epochs, examine the results (as you have been doing). *Those data vectors which the classifier has failed to learn are assigned a weighting factor to increase the erro*r (some number greater than 1); similarly, those data vectors that the classifier learned well are also assigned a weighting factor but here the value is less than one so that the importance of training error is reduced.

So for instance, suppose at the end of the first epoch (iteration through all data vectors comprising your training data set) your total error is 100; in other words, the square error (observed value - predicted value) summed over all data vectors in the training set.

These are two MSE values from among those listed in your Question

0.667        # poorly learned input vector => assign error multiplier > 1 
1.5e-10      # well-learned input vector => assign error multiplier < 1

In Boosting, you would find the input vectors that correspond to these two error measurements, and associate each an error weight; this weight will be greater than one in the first case, and less than one in the second. Let's suppose you assign error weights of 1.3 and .7., respectively. Further suppose that after the next epoch, your classifier has not improved with respect to learning the first of these two input vectors--i.e., it returns the same predicted values as it did in the last epoch. For this iteration/epoch however, the contribution to total error from that input vector is not 0.67 but 1.3 x 0.67, or approx. .87.

What is the effect of this increase error on training progress?

Larger error means a steeper gradient and therefore for the next iteration, a larger adjustment to the appropriate weights comprising the weight matrices--in other words, more rapid training focused at this particular input vector.

You might imagine that each of these data vectors has an implicit error weight of 1.0. Boosting just increases that error weight (for vectors that the classifier is unable to learn) and decreases this weight for vectors that it learns well.

What i have just described is a particular implementation called AdaBoost, which is probably the best known implementation of Boosting. For guidance and even code for langauge-specific implementations, have a look at boosting.com]1 (seriously). This Site is no longer maintained though, so here are a couple of more excellent resources that i have relied on and can recommend highly. The first is an academic site in the form of an annotated bibliography (including links to the papers discussed on the site). The first paper listed on this Site (with a link to pdf), The boosting approach to machine learning: An overview, is an excellent overview and an efficient source to acquire a working knowledge of this family of techniques.

There is also an excellent video tutorial on Boosting and AdaBoost at videolectures.net

OTHER TIPS

Is it possible that your neuronal network is just too simple to classify the data correctly? If you have too few neurons and/or too few layers there might be no weight configuration where the network learns the classification.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow