سؤال

I'm trying to understand how all the "big data" components play together in a real world use case, e.g. hadoop, monogodb/nosql, storm, kafka, ... I know that this is quite a wide range of tools used for different types, but I'd like to get to know more about their interaction in applications, e.g. thinking machine learning for an app, webapp, online shop.

I have vistors/session, transaction data etc and store that; but if I want to make recommendations on the fly, I can't run slow map/reduce jobs for that on some big database of logs I have. Where can I learn more about the infrastructure aspects? I think I can use most of the tools on their own, but plugging them into each other seems to be an art of its own.

Are there any public examples/use cases etc available? I understand that the individual pipelines strongly depend on the use case and the user, but just examples will probably be very useful to me.

هل كانت مفيدة؟

المحلول

In order to understand the variety of ways machine learning can be integrated into production applications, I think it is useful to look at open source projects and papers/blog posts from companies describing their infrastructure.

The common theme that these systems have is the separation of model training from model application. In production systems, model application needs to be fast, on the order of 100s of ms, but there is more freedom in how frequently fitted model parameters (or equivalent) need to be updated.

People use a wide range of solutions for model training and deployment:

نصائح أخرى

One of the most detailed and clear explanations of setting up a complex analytics pipeline is from the folks over at Twitch.
They give detailed motivations of each of the architecture choices for collection, transportation, coordination, processing, storage, and querying their data.
Compelling reading! Find it here and here.

Airbnb and Etsy both recently posted detailed information about their workflows.

Chapter 1 of Practical Data Science with R (http://www.manning.com/zumel/) has a great breakdown of the data science process, including team roles and how they relate to specific tasks. The book follows the models laid out in the chapter by referencing which stages/personnel this or that particular task would be performed by.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى datascience.stackexchange
scroll top