Optimal Feature-to-Instance Ratio in Back Propagation Neural Network

https://stackoverflow.com/questions/8090316

26-02-2021
|

Question

I'm trying to perform leave-one-out cross validation for modelling a particular problem using Back Propagation Neural Network. I have 8 features in my training data and 20 instances. I'm trying to make the NN learn a function in building a prediction model. Now, the problem is that the error rate is quite high in the prediction. My guess is that the number of instances in the training is less when compared to the number of features under consideration. Is this conclusion correct. Is there any optimal feature to instance ratio ?

Solution

(This topic is often phrased in the ML literature as acceptable size or shape of the data set, given that a data set is often described as an m x n matrix in which m is the number of rows (data points) and n is the number of columns (features); obvious m >> n is preferred.)

In an event, I am not aware of a general rule for an acceptable range of features-to-observations; there are probably a couple of reasons for this:

such a ratio would depend strongly on the quality of the data (signal-to-noise ratio); and
the number of features is just one element of model complexity (e.g., interaction among the features); and model complexity is the strongest determinant of the number of data instances (data points).

So there are two sets of approaches to this problem--which, because they are opposing, both can be applied to the same model:

reduce the number of features; or
use a statistical technique to leverage the data that you do have

A couple of suggestions, one for each of the two paths above:

Eliminate "non-important" features--i.e, those features that don't contribute to the variability in your response variable. Principal Component Analysis (PCA) is fast and reliable way to do this, though there are a number of other techniques which are generally subsumed under the rubric "dimension reduction."
Use Bootstrap methods instead of cross-validation. The difference in methodology seems slight but the (often substantial) improvement in reducing prediction error is well documented for multi-layer perceptrons (neural networks) (see e.g., Efron, B. and Tibshirani, R.J., The bootstrap method: Improvements on cross-validation, J. of the American Statistical Association, 92, 548-560., 1997). If you are not familiar with Bootstrap methods for splitting training and testing data, the general technique is similar to cross-validation except that instead of taking subsets of the entire data set you take subsamples. Section 7.11 of Elements is a good introduction to Bootstrap methods.

The best single source on this general topic that i have found is Chapter 7 Model Assessment and Selection from the excellent treatise Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. This book is available free to download from the book's homepage.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow