Question

I am using the following table to process the iislogs of around 15GB(.gz compressed). With amazon EMR (1 medium master instance, 4 large core instance, 2 task instance). It takes around 1 hour to even get the result of this query:

select uri, cs_Cookie as Cookie, count(*) as hits from tmp1 group by cs_Cookie, uri order by hits Desc;

I saw cpu utilization is 100% every time on all the DataNodes. So, could anyone please suggest how to reduce the timing of query as well as cpu utilization?

Table definition :

create external table marData(logdate string, time string, computername string, clientip string, uri string, qs string, localfile string, status string, referer string, w3status string, sc_bytes string, cs_bytes string, w3wpbytes string, cs_username string, cs_user_agent string, time_local string, timetakenms string, sc_substatus string, s_sitename string, s_ip string, s_port string, RequestsPerSecond string, s_proxy string, cs_version string, c_protocol string, cs_method string, cs_Host string, EndRequest_UTC string, date_local string, CPU_Utilization string, cs_Cookie string, BeginRequest_UTC string) partitioned by (month string) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
               WITH SERDEPROPERTIES (
               "input.regex" ="([0-9-]+) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9-]+ [0-9:.]+) ([^ ]*) ([^ ]*) (\".*\"|[^ ]*) ([0-9-]+ [0-9:.]+)",
               "output.format.string"="%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s %18$s %19$s %20$s %21$s %22$s %23$s %24$s %25$s %26$s %27$s %28$s %29$s %30$s %31$s %32$s")
location 's3://logdata/Mar';
Was it helpful?

Solution

How is the memory usage on the nodes during this query?

As @Charles Menguy said in the comments, high cpu usage isn't inherently a bad thing.

You may consider using more, larger nodes to complete the job in a shorter amount of time. It may take some experimentation, but this may even be cheaper for you. For example, we discovered that switching to larger nodes(we use m2.xlarge) allowed our jobs to run faster per dollar than our original use of more m1.large instances.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top