Question

Given the anonymized dataset of features below, where:

  1. "code" is a categorical variable.
  2. "x1" and "x2" are continuous variables.
  3. "x3" and "x4" are extracted features. They are the mean values of "x1" and "x2" respectively for each individual code.
       code  x1  x2  x3  x4
    0   100   1   2   2   4
    1   100   2   4   2   4
    2   100   3   6   2   4
    3   200   4   8   5  10
    4   200   5  10   5  10
    5   200   6  12   5  10
    6   300   7  14   8  16
    7   300   8  16   8  16
    8   300   9  18   8  16

Looking at the columns, for each code, x3 and x4 features have similar values - when x3 is 2 or 5 or 8, the code would be 100 or 200 or 300 respectively, and when x4 is 4 or 10 or 16, the code would be 100 or 200 or 300 respectively.

Intuitively, leaving these columns as they are without dropping any would lead to redundant features while training a model. My question is how true is this my hypothesis? I'm not so confident about it. Does it really matter when training a model? Does it depend on the model type (tree based or otherwise)?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top