Question

I have one log file in HDFS and need to process it to figure out max. hits to page ,url etc. data is given below:

10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET / HTTP/1.1" 403 202
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 209
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET / HTTP/1.1" 200 9157
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/js/lowpro.js HTTP/1.1" 200 10469
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/css/reset.css HTTP/1.1" 200 1014
10.223.157.186 - - [15/Jul/2009:15:50:35 -0700] "GET /assets/css/960.css HTTP/1.1" 200 6206 
...............................

How can i process this data? Is it possible to put it in Hive? If yes then how? or mapReduce is better option? Please guide.

Was it helpful?

Solution

Hive is a good option.

You have two clear ways to do it in hive. Either 1) Write a custom SerDe to parse the log lines, or 2) write a hive view that uses hive's built-in string UDFs (probably regexp_extract mostly) to parse the components.

Writing the SerDe will probably be more efficient and overall a better route, but a tiny bit trickier code to write.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top