Question

I want to create in-house funnel analysis infrastructure. All the user activity feed information would be written to a database / DW of choice and then, when I dynamically define a funnel I want to be able to select the count of sessions for each stage in the funnel.

I can't find an example of creating such a thing anywhere. Some people say I should use Hadoop and MapReduce for this but I couldn't find any examples online.

Was it helpful?

Solution

Your MapReduce is pretty simple:

Mapper reads row of a session in log file, its output is (stag-id, 1)

Set number of Reducers to be equal to the number of stages.

Reducer sums values for each stage. Like in wordcount example (which is a "Hello World" for Hadoop - https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0).

You will have to set up a Hadoop cluster (or use Elastic Map Reduce on Amazon).

To define funnel dynamically you can use DistributedCache feature of Hadoop. To see results you will have to wait for MapReduce to finish (minimum dozens of seconds; or minutes in case of Amazon's Elastic MapReduce; the time depends on the amount of data and the size of your cluster).

Another solution that may give you results faster - use a database: select count(distinct session_id) group by stage from mylogs;

If you have too much data to quickly execute that query (it does a full table scan; HDD transfer rate is about 50-150MB/sec - the math is simple) - then you can use a distributed analytic database that runs over HDFS (distributed file system of Hadoop).

In this case your options are (I list here open-source projects only):

Apache Hive (based on MapReduce of Hadoop, but if you convert your data to Hive's ORC format - you will get results much faster).

Cloudera's Impala - not based on MapReduce, can return your results in seconds. For fastest results convert your data to Parquet format.

Shark/Spark - in-memory distributed database.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top