I would rather use Pig. Easy to learn and write, no lengthy pieces of code. Just express your data processing in terms of transformation, or a data flow and get the desired result. If it fits into your needs, it's way better than raw MR jobs. Pig was developed for these kinda stuff. It'll definitely save a lot of time.
Log file analysis in Hadoop/MapReduce
-
29-06-2022 - |
문제
Hi I have some query log files of the following form:
q_string q_visits q_date
0 red ballons 1790 2012-10-02 00:00:00
1 blue socks 364 2012-10-02 00:00:00
2 current 280 2012-10-02 00:00:00
3 molecular 259 2012-10-02 00:00:00
4 red table 201 2012-10-02 00:00:00
I have a file per day, for each month for the period of a year. What I would like to do is:
(1) Group the files by month (or more specifically group all of the q_strings belonging to each month)
(2) Since the same q_string may appear on multiple days, I would like to group the same q_strings within the month, summing on q_visits across all the instances of that q_string
(3) Normalise the q_visits against the grouped q_string (by dividing the sum of q_visits for the grouped q_string by the sum of q_visits across all q_strings within the month)
I expect the output to have a similar schema to the input except to have an extra column with normalised monthly q_visit volumes.
I have been doing this in Python/Pandas, but now have more data and feel that the problem lends itself more easily to MapReduce.
Would the above be easy to implement in EMR/AWS? Conceptually, what would be the MR workflow for doing the above? I would like to keep coding in Python so will probably use streaming.
Thanks in advance for any help.
해결책
다른 팁
For structured data, it will be much easier to use PIG rather MAP reduce. You can write the same solution in PIG with minimum number of codes and thus will reduce the development time.