Question

Has any tried Apache Giraph on EMR?

It seems to me the only requirements to run on EMR are to add proper bootstrap scripts to the Job Flow configuration. Then I should just need to use a standard Custom JAR launch step to launch the Giraph Runner with appropriate arguments for my Giraph program.

Any documentation/tutorial or if you could just share your experience with Giraph on EMR, that will be much appreciated.

Was it helpful?

Solution

Yes, I run Giraph jobs on EMR regularly but I don't use "Job Flows", I manually login to the master node and use it as a normal Hadoop cluster (I just submit the job with hadoop jar command).

You are right, you need to add bootstrap scripts to run Zookeeper and to add Zookeeper details to core-site config. Here is how I did it :

Bootstrap actions -

Configure Hadoop s3://elasticmapreduce/bootstrap-actions/configure-hadoop --site-key-value, io.file.buffer.size=65536, --core-key-value, giraph.zkList=localhost:2181, --mapred-key-value, mapreduce.job.counters.limit=1200

Run if s3://elasticmapreduce/bootstrap-actions/run-if instance.isMaster=true, s3://hpc-chikitsa/zookeeper_install.sh

The contents of zookeeper_install.sh are :

#!/bin/bash
wget --no-check-certificate http://apache.mesi.com.ar/zookeeper/zookeeper3.4./zookeeper3.4.5.tar.gz
tar zxvf zookeeper-3.4.5.tar.gz
cd zookeeper-3.4.5
mv conf/zoo_sample.cfg conf/zoo.cfg
sudo bin/zkServer.sh start

Then copy your Giraph jar file to master node (using scp) and then ssh to master node and submit the job using hadoop jar command.

Hope that helps.

Here is a relevant mail-thread on giraph-user mailing list :

https://www.mail-archive.com/user%40giraph.apache.org/msg01240.html

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top