Question

I am using Python and have to work on following scenario using Hadoop Streaming: a) Map1->Reduce1->Map2->Reduce2 b) I dont want to store intermediate files c) I dont want to install packages like Cascading, Yelp, Oozie. I have kept them as last option.

I already went through the same kind of discussion on SO and elsewhere but could not find an answer wrt Python. Can you please suggest.

Was it helpful?

Solution

b) I dont want to store intermediate files

c) I dont want to install packages like Cascading, Yelp, Oozie.

Any reason why? Based on the response, a better solution could be provided.

Intermediates files cannot be avoided, because the o/p of the previous Hadoop job cannot be streamed as i/p to the next job. Create a script like this

run streaming job1
if job1 is not success then exit
run streaming job2
if job2 is success them remove o/p of job1 else exit
run streaming job3
if job3 is succcess them remove o/p of job2 else exit

OTHER TIPS

Why not using MapReduce frameworks for python streaming, like Dumbo https://github.com/klbostee/dumbo/wiki/Short-tutorial, or MRJob http://packages.python.org/mrjob/

For example, with dumbo, your pipe would be:

job.add_iter(Mapper1, Reducer1)
job.add_iter(Mapper2, Reducer2)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top