Question

I know this is kind of a broad question but I have tried to scour both this forum and the internet in general to no avail for this particular situation. So imagine I have a model trained for which, though the data initially might not be complete and clean, I took steps to make the data complaint with the model requirements (no outliers where appropriate, de-skewed if necessary, normalized if necessary, null values imputed appropriately). This is done in a cross validation framework. All this stuff works and is absolutely fine when tuning the model but I run into problems when I try to make a single prediction out of it (meaning I have a single "test" record - think web service with some fields that can be null). In fact, null values generally need a dataset to refer to for filling, as well as for the normalization/outlier procedures.

Initially I thought about linking such "test" record to a portion of the "train" dataset so that I would not run into this problem (such problems would be resolved) but at that point other issues would arise: how would I choose such dataset? if I used the most recent data, would I bias it somehow? and using the whole dataset is impractical as well as potentially unfeasible when dealing with "big" data.

do you happen to know whether there are some best practices on the topic or could you refer me the themes/keywords that deal with these issues?

p.s.: for the relevance of the problem, the null values most likely will remain there (I have no way of forcing them beforehand in the web application in order to have a smoother user experience)

Was it helpful?

Solution

You need to save the instructions for performing these preprocessing steps, not necessarily the dataset that you extracted them from.

See
Obtaining consistent one-hot encoding of train / production data
Binary Classification - One Hot Encoding preventing me using Test Set
one-hot-encoding categorical data gives error

In particular, sklearn preprocessors can be pickled, then used with their transform method in production, if you can use sklearn in deployment. PMML also cam translate most transformers. Or you can write your own simple transformer.

As to using newer data to rework the transformers, that's getting closer to retraining; in most settings, I would keep it in the same place as refitting the model: either both offline or both online.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top