feature redundancy

https://datascience.stackexchange.com/questions/12259

16-10-2019
|

Pergunta

Why exactly does features being dependent on each other, features having high correlation with one another, mean that they would be redundant? Also, does PCA help get rid of redundant/irrelevant features or do we have to get rid of redundant/irrelevant features before running PCA on our dataset?

Solução

For the sake of training, features that are highly correlated offer little training "value" as the presence/state of one value can always (or almost always) be used to determine the presence/state of the other. If this is the case there's no reason to add both features as having both will have little impact on the predictions - if A "on" = B "off", and A "off" = B "on", then all states can be represented by just learning off either A or B. This is greatly simplified, but the same is true for other highly correlated values.

PCA can help reduce features, but in any case, if you've identified redundant or highly correlated features that will be of little use in training, it probably makes sense to eliminate them right away and then use PCA, or other feature importance metrics that can be generated by training off your full dataset, to further optimize your training feature set.

Outras dicas

Redundant features can be features that are multicolinear (i.e. highly correlated), but more importantly they're measuring the same thing without a unique contribution.

For instance, age and income might be highly correlated, but in some analyses they still have a unique effect in your model and may have conceptual differences that you want to captured for interpretation. OTOH, age and birth date are purely redundant in most use cases I can think of (though there are always exceptions, such as if season of birth is important).

Can PCA help reduce redundancy? Sure. It's one of at least dozens of techniques you could use for this.

One way you use PCA for feature selection is to look at the factor loading on the principal components and determine which correlated variables are measuring the same principal component then pick the top 1 or few variables to represent that latent variable, eliminating highly correlated non-distinct features.

Should you eliminate redundant features before PCA? If you're going to use the principal components for prediction rather than feature elimination, then yes.

You can do one round of feature analysis involving PCA or other techniques and a second round to create latent variables for your model if you want to do both.

Some additional tools for feature selection:

Minimum Redundancy Maximum Relevance
Correlation Feature Selection
Canonical Correlations analysis
Factor Analysis
Use of a covariance matrix
Singular Value Decomposition
Variance Inflation Factors

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange