Question

Is it always a good idea to remove features that have high mutual information with each other and to remove features that have very low mutual information with the target variable? Why or why not?

Was it helpful?

Solution

Doing that is a very good idea. The problem is that doing that is very hard. Feature selection is a NP-complete problem. The practical meaning as that we don't know any fast algorithm that can select only the needed feature.

In the other direction, omitting features that don't have mutual information (MI) with the concept might cause you to throw the features you need most. There are cases in which a single feature is useless but given more features it becomes important.

Consider a concept which is the XOR of some features. Given all the features, the concept is totally predictable. Given one of them, you have 0 MI.

A more real life example is of age at death. Birth date and death date give you the age. One of them will have very low correlation (due to increase in life expectancy).

In practice, omitting features with low MI is OK. Many learning algorithms are using MI so they won't be able to use the omitted variables anyway. As for the selection itself, there are many algorithms, usually heuristics or approximation algorithm that are quite handy.

OTHER TIPS

As with many things, it depends. The specifics of the relationship of your variables to the domain they describe will dictate, and even then relationships may not be intuitive. Seemingly disparate features can have a significant effect when combined in what is known as feature extraction.

Automated feature engineering techniques can help you decide what features are significant if you have the time and resources available, particularly when it comes to testing the impact of combined features. Additionally, some methods have the benefit of embedded feature selection, wherein the algorithm itself tends to diminish the effect of insignificant variables, e.g.: lasso regression, regularized decision trees, random forests, etc.

Here's a nice primer: http://machinelearningmastery.com/an-introduction-to-feature-selection/

The fact that a feature is redundant in the presence of another one, or is not informative enough to describe the target variable, is not necessarily a sign of that feature not being useful.

Indeed, it may be the case that such feature may be extremely informative when combined with another one, in spite of not being very useful when considered in isolation.

Therefore, when applying feature selection methods, you should also consider combinations of features.

However, and as pointed out by another answer to this question, finding the best combination of features is a NP-complete problem. Therefore, applying feature selection to individual features may be a good approximation. However, I'd rather apply a greedy approach (see for instance https://studentnet.cs.manchester.ac.uk/pgt/COMP61011/goodProjects/Shardlow.pdf for more information about the topic.)

EDIT to answer OP's comment:

a) The table below displays an extreme example of a feature that by itself is very informative, but in combination with others is totally redundant (feature_2). This is a regression problem in which we are trying to build a model to predict the "output" variable from "feature_1" and "feature_2".

| feature_1 | feature_2 | output |
|-----------|-----------|--------|
|         1 |         1 |    0.1 |      
|         2 |         2 |    0.2 |    
|         3 |         3 |    0.3 |     
|         4 |         4 |    0.4 |      
|         5 |         5 |    0.5 |    
|         6 |         6 |    0.6 |

b) The example below shows an extreme example of a feature that may not be very informative by itself, but that is very informative together with another one (feature_2).

| feature_1 | feature_2 | output |
|-----------|-----------|--------|
|         1 |         1 |    0.1 |      
|         2 |         2 |   0.25 |    
|         3 |         1 |    0.3 |     
|         4 |         2 |   0.45 |      
|         5 |         1 |    0.5 |    
|         6 |         2 |   0.65 |   
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top