Obtaining consistent one-hot encoding of train / production data
-
02-11-2019 - |
Question
I'm building an app that will require user input. Currently, on the training set, I run the following code, in which data
is a pandas dataframe with a combination of categorical and numerical columns.
dummified_data = data.get_dummies()
train_data = dummified_data[:10000]
test_data = dummified_data[10000:12000]
Currently, I have a hand-written function that takes user-inputted data and transforms it into a format like dummy data. This doesn't seem sustainable as the number of columns/the size of my categorical variables grows.
Is there a way to dummify training data and production data consistently?
No correct solution
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange