Perceptron learning - most important feature

Question 1

This turned out to be a lot simpler than I originally thought. The answer/process is as follows:

Given a set of input vectors such as the following:

[1,0,1,0], [0,1,0,1]

The data is already constrained between 0 and 1 to minimize the variance. However, in the case of my data I have something more like the following:

[0,145,0,132],[0,176,0,140]

This causes the variance in some input features to be much larger and you would therefore not be able to use the weight vector as an indicator of feature importance. Therefore, in order for the weight vector to be an indicator of importance we much normalize the data first by dividing by the feature max.

For the above set that would be: [0,176,0,140]

This would result in a set of uniform feature vectors and would also result in the weight vector being an indicator of feature importance.

Question 2

The importance of a feature is captured by computing how much the learned model depends on a feature f.

A perceptron is a simple feed-forward neural network, and for a neural network (which is a real-valued nonlinear function), dependency corresponds to the partial derivative of output function with respect to f.

The relative importance of a feature is proportional to its average absolute weight on a trained perceptron. This is not always true for neural networks in general. For instance, this need not hold true for multi-layer perceptrons.

For more details (typing the exact formula here will be a notational mess), look at sections 2 and 3 of this paper. I believe equation (8) (in section 3) is what you are looking for.

There, the score is a summation over multiple learners. If yours is a single-layer perceptron, the function learned is a single weight vector

w = (w1, w2, ... wn)

Then, the average absolute weight I mention at the beginning is simply the absolute weight |wi| of the i-th feature. This seems too simple a measure to be ranking the importance of features, right? But ... if you think about it, an n-dimensional input x gets transformed to w . x (the vector dot product). That is, the i-th weight wi fully controls how much the input changes along one dimension of the vector space.

By the way, in most (if not all) classifiers, the feature weight is itself the measure of its importance. It's just that the weights are computed in more complicated ways for most other classifiers.

Question 3

Because Perceptron Learning, especially Multi-layer Perceptron Network, is a black box algorithm where the weights and the activations are affected partially by some or almost or all features, we have not had tools to extract directly the feature importance yet, while it is easy for tree-based models to do that. However, we can use the PERMUTATION IMPORTANCE method that is introduced here: https://towardsdatascience.com/feature-importance-with-neural-network-346eb6205743