Looking for example infrastructure stacks/workflows/pipelines
-
16-10-2019 - |
سؤال
I'm trying to understand how all the "big data" components play together in a real world use case, e.g. hadoop, monogodb/nosql, storm, kafka, ... I know that this is quite a wide range of tools used for different types, but I'd like to get to know more about their interaction in applications, e.g. thinking machine learning for an app, webapp, online shop.
I have vistors/session, transaction data etc and store that; but if I want to make recommendations on the fly, I can't run slow map/reduce jobs for that on some big database of logs I have. Where can I learn more about the infrastructure aspects? I think I can use most of the tools on their own, but plugging them into each other seems to be an art of its own.
Are there any public examples/use cases etc available? I understand that the individual pipelines strongly depend on the use case and the user, but just examples will probably be very useful to me.
المحلول
In order to understand the variety of ways machine learning can be integrated into production applications, I think it is useful to look at open source projects and papers/blog posts from companies describing their infrastructure.
The common theme that these systems have is the separation of model training from model application. In production systems, model application needs to be fast, on the order of 100s of ms, but there is more freedom in how frequently fitted model parameters (or equivalent) need to be updated.
People use a wide range of solutions for model training and deployment:
Build a model, then export and deploy it with PMML
AirBnB describes their model training in R/Python and deployment of PMML models via OpenScoring.
Pattern is project related to Cascading that can consume PMML and deploy predictive models.
Build a model in MapReduce and access values in a custom system
Conjecture is an open source project from Etsy that allows for model training with Scalding, an easier to use scala wrapper around MapReduce, and deployment via Php.
Kiji is an open source project from WibiData that allows for real-time model scoring (application) as well as functioanlity for persisting user data and training models on that data via Scalding.
Use an online system that allows for continuously updating model parameters.
- Google released a great paper about an online collaborative filtering they implemented to deal with recommendations in Google News.
نصائح أخرى
One of the most detailed and clear explanations of setting up a complex analytics pipeline is from the folks over at Twitch.
They give detailed motivations of each of the architecture choices for collection, transportation, coordination, processing, storage, and querying their data.
Compelling reading! Find it here and here.
Chapter 1 of Practical Data Science with R (http://www.manning.com/zumel/) has a great breakdown of the data science process, including team roles and how they relate to specific tasks. The book follows the models laid out in the chapter by referencing which stages/personnel this or that particular task would be performed by.