Question

I have a pig script which will get all the data from cassandra, do little transformation and store onto hdfs. When I execute it on the grunt console of pig, it takes nearly 30 min since there are lot of data in cassandra.

But when i execute the same using oozie work flow, it executes but take a very long time nearly one and half hours. When I checked the hadoop logs this is what it says.

2013-11-19 01:20:00,871 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: master:50030/jobdetails.jsp?jobid=job_201311190052_0002 Heart beat .. .. Heart beat Heart beat 2013-11-19 02:09:59,172 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2013-11-19 02:10:17,289 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete

So nearly it checks for the Heart Beat for 50 minutes and then again resumes the process.

I did a telnet from source to destination on 9000 port. I was able to connect. I even checked the /etc/hosts file for the IP configuration on both machines, which looks good by the way.

We still don't understand why is this happening? and what this it? and also how to overcome this so that the processing will be done a little quicker. Can anyone please help us in this regard? Any help is highly appreciated.

Était-ce utile?

La solution

Here are the 2 things that solved the problem.

1) Used where clause to get the data from cassandra instead of getting all the data.

2) by adding few more fans to the machine since it was also due to insufficient cooling of HDD.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top