Question

I have some categorical variables in my dataset for a regression problem.

1) One of the variable can take 3 values (Girls, Boys, Girls&Boys). Converting it into one-hot encoding or binary encoding will treat all three values as a different class. How can I use it efficiently retaining the information that 'Girls&Boys' include both? Is breaking into two separate columns for girls and boys is the only approach in such cases?

2) Age range: (18-35, 35-50 etc.) I have broken it down to 2 columns of age_min and age_max. Is there any better way to use such kind of features with values in range?

3) Range of percentage: It can take only 5 values (0, 1, 1-5, 5-10, 10). How should I use this variable for training my model? Here I cannot break it down into 2 columns (like with age) because of "1" and "10" fixed values. How to treat variables with fixed as well as range values?

4) Similarity: It can also take 5 values (0, 1, 1-5, 5-10, 10) but it also has an "auto" option. That means if it is "auto" it can take any random value which we do not know about. How should I incorporate that? Should I create a separate column for "whether it is auto or not"? Then, in the original similarity column, what value should I put for observations with "auto". I cannot put "0" as it is already a value and I am sure that "auto" won't be 0. Does putting it "None" will create a difference? How to treat unknown values?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top