In machine learning we have modules that perform operations on data in a sequencial manner. Modules are generally the following:

  1. Data Collection Module: Takes raw data from a resource (filesystem,database,...) and ensures the data in the pipeline. We can assume that the result of this module is an object of type "Dataset"

  2. Data Cleaning Module: Takes as input the Dataset object and checks errors in data; mainly missing values and outputs a new Dataset object with clean data

  3. Data preprocessing: Takes the new Dataset object makes math operations on the data it wraps (normalization, normalization, standardization ...) and output a new dataset object with a new format.

  4. A Training Module: This module is responsible of training machine learning models using different algorithms, it can be on algorithm or multiple algorithms in multiple stage to have different result to compare them and select the best peroforming model

  5. Last, a test model: That takes the trained and selected model and ensures it has low error rates by inputting a sample of test data

The pipeline, we want to develop as you see encompasses a set of object, each of which performs a set of operations on our input data and passes the output to the next object in the sequence.So our data is forced to force the path we specify by configuring our pipeline. With some variable objects that can change in any of the chains of our pipeline.

The pipeline can be represented as follows:

raw_data ==> data_collection > data_cleaning > data_preprocessing > model_training > model_testing ==> model

These modules have some common operations such that execute() and validate() data. Imagine a single abstract class (let's say we call it IMLOperation) that works as an interface holding these common operations from which all the operations in the pipeline are derived (sub-classed as Preprocess, Collect .. objects). Do you think that this approach together with the Iterator Design Pattern to ensure the order is suitable to develop this pipeline? Or is the strategy pattern along with a client that provides a stack of ordered operations a better solution for this pipeline architecture ?

有帮助吗?

解决方案

As in any data processing or sequential operations, look for functional programming patterns as a cleaner alternative for design patterns.

Design patterns wise:

  • strategy pattern helps when you should pick 1 algorithm from a list of algorithms to apply as a transformation. Here your modules can be considered an algo to apply
  • iterator pattern here would be a glorified foreach loop over the individual data record (iteration unit)
  • map(reduce) pattern? = your operations here are a set of maps (as in apply a method in a chain https://en.wikipedia.org/wiki/Map_(higher-order_function)). When I have multiple transformation methods I tend to apply maps as it makes the code more modular and I can better test each part. Most mdoern programming languages are optimised for using maps to transform data, and the clarity gain outweights the performance gain from putting everything in 1 for loop in the long run (or any time for that matter, specially in a team)

Python implementation wise:

  • there is core method map that allows to chain operations in a lazy fashion
  • use a streaming framework (rxpy is what we use in production now) if you want to benefit from a great abstraction and helps with the general project organisation. Rxpy provides a framework, but there are no connectors out of the box afaik. Do use a framework for production work, either homebrewed or open source.

Python map example:

# raw data
a = [1,2,3,4]
# chain operations. i.e. basic lambdas - replace with your own module.run() 
b = map(lambda x: x*2, a)
c = map(lambda x: x-3, b)

# get the result as it's lazy in computing it
list(c)
许可以下: CC-BY-SA归因
scroll top