Use MapReduce to split a strings and rebuild it

https://stackoverflow.com/questions/22086024

18-10-2022
|

Question

Hello I am a newbie to Hadoop & MapReduce programming. I am working a bunch of apache logs that we have to understand access behavior. We are now looking at actual URIs and referrer URIs. These referrer URIs come a query string, and I am trying to parse that query string via Mapper of the MapReduce, and as I do not have any reducer functionality I am not building a real reducer.

 #   ip datetime method uri status code refUri userAgent
79.28.43.25 - - [25/Jan/2009:13:18:02 +0000] "GET /blog/2007/01/internet-explorer-7-in-italiano/ HTTP/1.1" 200 14487 "http://www.google.it/search?hl=it&q=aggiornamento+internet+explorer+&btnG=Cerca+con+Google&meta=&aq=f&oq=" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

Now I want to convert this line into

#   ip datetime method uri status code refUri h1 q btnG meta aq oq userAgent
79.28.43.25 - - [25/Jan/2009:13:18:02 +0000] "GET /blog/2007/01/internet-explorer-7-in-italiano/ HTTP/1.1" 200 14487 "http://www.google.it/search?hl=it&q=aggiornamento+internet+explorer+&btnG=Cerca+con+Google&meta=&aq=f&oq=" "it" "aggiornamento+internet+explorer+" "Cerca+con+Google" "" "f" "" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

Is this a good use case to use map only hadoop jobs, the number of logs that we have is over 1 PB and we expect that to grow.

Solution

Yes. If you only need to map the data, then there's no need for the reduce step. Make sure you set numReducers to zero so that the reduce step is skipped entirely.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow