Question

How can I make a one hot encoding for a unknown dataset which can iterate and check the dytype of the dataset and do one hot encoding by checking the number of unique values of the columns, also how to keep track of the new one hot encoded data with the original dataset?

Was it helpful?

Solution

I would recommend to use the one hot encoding package from category encoders and select the columns you want to using pandas select dtypes.

import numpy as np
import pandas as pd     
from category_encoders.one_hot import OneHotEncoder

pd.options.display.float_format = '{:.2f}'.format # to make legible

# make some data
df = pd.DataFrame({'a': ['aa','bb','cc']*2,
                   'b': [True, False] * 3,
                   'c': [1.0, 2.0] * 3})


cols_encoding = df.select_dtypes(include='object').columns
ohe = OneHotEncoder(cols=cols_encoding)
encoded = ohe.fit_transform(df) 

Note that you can change the way you handle unseen data with

handle_unknown: str

options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top