Association between features
-
06-11-2019 - |
Question
Given the anonymized dataset of features below, where:
- "code" is a categorical variable.
- "x1" and "x2" are continuous variables.
- "x3" and "x4" are extracted features. They are the mean values of "x1" and "x2" respectively for each individual code.
code x1 x2 x3 x4
0 100 1 2 2 4
1 100 2 4 2 4
2 100 3 6 2 4
3 200 4 8 5 10
4 200 5 10 5 10
5 200 6 12 5 10
6 300 7 14 8 16
7 300 8 16 8 16
8 300 9 18 8 16
Looking at the columns, for each code, x3 and x4 features have similar values - when x3 is 2 or 5 or 8, the code would be 100 or 200 or 300 respectively, and when x4 is 4 or 10 or 16, the code would be 100 or 200 or 300 respectively.
Intuitively, leaving these columns as they are without dropping any would lead to redundant features while training a model. My question is how true is this my hypothesis? I'm not so confident about it. Does it really matter when training a model? Does it depend on the model type (tree based or otherwise)?
No correct solution
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange