Fine tuning PIG for local execution

https://stackoverflow.com/questions/4460727

10-10-2019
|

Pergunta

I'm using PIG latin for log processing because its expressiveness in a problem where the data is not big enough to worry about setting up a whole hadoop cluster. I'm running PIG in local mode but I think that it isn't using all the cores it has available (16 at the moment), monitoring the CPU shows 200% of CPU usage at maximum.

Is there any tutorial or recommendations for fine tuning PIG for local execution? I'm sure that all the mappers could use all the available cores with some easy tweaking. (In my script I have already set up the default_parallel parameter to 20)

Best regards.

Solução

Pig's documentation makes it clear that local operation is intended to be run single-threaded, taking different code paths for certain functions that would otherwise use distributed sort. As a result, optimizing for Pig's local mode seems like the wrong solution to the presented problem.

Have you considered running a local, "pseudo-distributed" cluster instead of investing in full cluster setup? You can follow Hadoop's instructions for pseudo-distributed operation, then point Pig at localhost. This would have the desired result, at the expense of two-step startup and teardown.

You'll want to raise the number of default mappers and reducers to consume all cores available on your machine. Fortunately, this is reasonably well-documented (admittedly, in the cluster setup documentation); simply define mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum in your local copy of $HADOOP_HOME/conf/mapred-site.xml.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow