Pregunta

Having working on data sets, sometimes we want to keep track of mtiple models with different architectures which work on the same data set on which we have made some transformations and preprecessing of data has been done.

So I would like to know what is the elegant way to work on multiple models which use the same data set? Because having multiple models on the same notebook is cumborsome and recreating the same data preprocessing nd transformations on separate notebooks is also a lot of copy-paste which I think can be solved with some existing solutions which I am not aware of. What is the industry standard when doing such tasks?

Any help is appreciated.

¿Fue útil?

Solución

If I was in your situation, I would approach it this way:

  • Create a "preprocessing"/"data" module

That module can either be a simple data access layer that can be shared across notebooks, or it can also include preprocessing steps that you can add. This enforces a common data access layers across notebooks without necessarily duplicating code.

So you could do something like this:

from data_layer import Data, Preprocessor

preprocessor = Preprocessor(**kwargs)
data = Data(preprocessor = preprocessor, **kwargs)

In an ideal world, you should be able to define your data access layer as data itself. For instance, a JSON document containing the data source, and the preprocessing options.

  • Create a different notebook per model

And in this case, if your models share components, I'd also create modules they can all import from. The same way you'd use sklearn or torch component.

Licenciado bajo: CC-BY-SA con atribución
scroll top