Obtaining consistent one-hot encoding of train / production data

https://datascience.stackexchange.com/questions/54052

python
pandas
dummy-variables

02-11-2019
|

Question

I'm building an app that will require user input. Currently, on the training set, I run the following code, in which data is a pandas dataframe with a combination of categorical and numerical columns.

dummified_data = data.get_dummies()
train_data = dummified_data[:10000]
test_data = dummified_data[10000:12000]

Currently, I have a hand-written function that takes user-inputted data and transforms it into a format like dummy data. This doesn't seem sustainable as the number of columns/the size of my categorical variables grows.

Is there a way to dummify training data and production data consistently?

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange