Several independent variables based on the same underlying data

https://datascience.stackexchange.com/questions/67236

08-12-2020
|

Question

I got a data containing, among others, two feature variables, which are based from the same underlying data (i.e. have mutual information), but they convey different information/message. How to handle such cases?

Since, logically, they will be highly correlated, it would make sense to only use one of them, preferably the one which conveys more information. But:

Is this the correct approach, or do we actually lose a valuable information by not including it?
If including it is the correct approach, is there anything else needed to be done and/or checked to prevent messing up the model (since they will be highly correlated)?

Example 1:

Let's say we have a feature which can be pair of any number from 1 to 3, e.g. (1,1), (3,2), (2,1), etc.
And we also have another feature which tells us how many ones (i.e., 1) are in the previous feature, so for the previous cases this would correspond to 2, 0, 1, etc.
Although this second feature does not provide us with any new information not already present in the first feature per se (i.e. can be deduced from the first feature), it does have some special meaning, i.e. lets say that the number of ones is expected to influence the results (dependent variable).

Example 2:

One variable is a discrete/integer value, and the other one is 0 if the value of the first feature is below some specific value, and 1 if higher or the same.
Just as in the Example 1, the second feature has some special meaning.

Solution

For predictive power, in general, including both shouldn't be a problem. But there is a lot of nuance here.

Foremost, if predictive power isn't all you care about: if you're making statistical inferences, or care about explainability and feature importances, then including both can cause issues. Briefly, your model may split the importance of the underlying variable across all the derived ones.

In some cases, it might not help at all: a tree model on your second example can already easily discover the derived variable given only the original. It may be actively harmful: adding too many of these derived variables might provide noise for your model to overfit to, rather than useful signal.

In some cases, it might help a lot: in your first example, a linear classifier won't see the derived feature at all from the original, and a tree model would require several consecutive splits to see it. A neural network could build it, but it's not clear whether the training process would find it.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange