Question

Hi I have some query log files of the following form:

    q_string    q_visits    q_date
0   red ballons 1790        2012-10-02 00:00:00
1   blue socks  364         2012-10-02 00:00:00
2   current     280         2012-10-02 00:00:00
3   molecular   259         2012-10-02 00:00:00
4   red table   201         2012-10-02 00:00:00

I have a file per day, for each month for the period of a year. What I would like to do is:

(1) Group the files by month (or more specifically group all of the q_strings belonging to each month)

(2) Since the same q_string may appear on multiple days, I would like to group the same q_strings within the month, summing on q_visits across all the instances of that q_string

(3) Normalise the q_visits against the grouped q_string (by dividing the sum of q_visits for the grouped q_string by the sum of q_visits across all q_strings within the month)

I expect the output to have a similar schema to the input except to have an extra column with normalised monthly q_visit volumes.

I have been doing this in Python/Pandas, but now have more data and feel that the problem lends itself more easily to MapReduce.

Would the above be easy to implement in EMR/AWS? Conceptually, what would be the MR workflow for doing the above? I would like to keep coding in Python so will probably use streaming.

Thanks in advance for any help.

Was it helpful?

Solution

I would rather use Pig. Easy to learn and write, no lengthy pieces of code. Just express your data processing in terms of transformation, or a data flow and get the desired result. If it fits into your needs, it's way better than raw MR jobs. Pig was developed for these kinda stuff. It'll definitely save a lot of time.

OTHER TIPS

For structured data, it will be much easier to use PIG rather MAP reduce. You can write the same solution in PIG with minimum number of codes and thus will reduce the development time.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top