Efficient dataset size for a feed-foward neural network training

https://stackoverflow.com/questions/4100785

29-09-2019
|

Question

I'm using a feed-foward neural network in python using the pybrain implementation. For the training, i'll be using the back-propagation algorithm. I know that with the neural-networks, we need to have just the right amount of data in order not to under/over-train the network. I could get about 1200 different templates of training data for the datasets.
So here's the question:
How do I calculate the optimal amount of data for my training?

Since I've tried with 500 items in the dataset and it took many hours to converge, I would prefer not to have to try too much sizes. The results we're quite good with this last size but I would like to find the optimal amount. The neural network has about 7 inputs, 3 hidden nodes and one output.

Solution

How do I calculate the optimal amount of data for my training?

It's completely solution-dependent. There's also a bit of art with the science. The only way to know if you're into overfitting territory is to be regularly testing your network against a set of validation data (that is data you do not train with). When performance on that set of data begins to drop, you've probably trained too far -- roll back to the last iteration.

The results were quite good with this last size but I would like to find the optimal amount.

"Optimal" isn't necessarily possible; it also depends on your definition. What you're generally looking for is a high degree of confidence that a given set of weights will perform "well" on unseen data. That's the idea behind a validation set.

OTHER TIPS

The diversity of the dataset is much more important than the quantity of samples you are feeding to the network.

You should customize your dataset to include and reinforce the data you want the network to learn.

After you have crafted this custom dataset you have to start playing with the amount of samples, as it is completely dependant on your problem.

For example: If you are building a neural network to detect the peaks of a particular signal, it would be completely useless to train your network with a zillion samples of signals that do not have peaks. There lies the importance of customizing your training dataset no matter how many samples you have.

Technically speaking, in the general case, and assuming all examples are correct, then more examples are always better. The question really is, what is the marginal improvement (first derivative of answer quality)?

You can test this by training it with 10 examples, checking quality (say 95%), then 20, and so on, to get a table like:

you can then clearly see your marginal gains, and make your decision accordingly.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow