Question

My dataset consists of 3000 rows and 50 columns, out of which one column (ESTIMATE_FAMILY_CONTRIBUTION) contains all numerical value(around 2000 different values like 20,30,32....) but got one value as String e.g. 'No_information'.

When I create dummies(One-Hot Encoded) for the Feature (pd.get_dummies()), around 2000 new columns are created for the original column i.e. ESTIMATE_FAMILY_CONTRIBUTION.
What I want is to have only 2 columns to be created, one for 'No_information' and other with all the numerical values. How do I do it?

Was it helpful?

Solution

You have 2000 different values available for a dataset of 3000. I don't think you should treat that as a Categorical column.

Treat "No_information" as a NaN and impute using the best-suited strategy using the relation with other columns

Edit post comment

If you want to treat it as Categorical, I will suggest
- Create "bins" of values e.g. <50- Very Low, 50-500 - Medium, etc.
- Then create One-Hot encoded data
- With this approach too, "No_information" should be treated as NaN. The reason being that when Info is not available it could have been anything i.e. Low, High, Medium. Treating it as 4th value will have lesser information.
- Try different bins/approaches and see which produce the best result

Dealing with High cardinality of Categorical feature -
- Search internet/SE with "Encode feature with very high cardinality". This is a known challenge, you will get plenty of resources.
- Try other encoding approaches. See these links -
Beyond OHE
Library
kaggle post

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top