Question

I am trying to do what should be a simple task: I need to convert a text file to upper case using Hadoop streaming with Python.

I want to do it by using the TextInputFormat which passes file position keys and text values to the mappers. The problem is that Hadoop streaming automatically discards the file position keys, which are needed to preserve the ordering of the document.

How can I retain the file position information of the input to the mappers? Or is there a better way to convert a document to upper case using Hadoop streaming?

Thank you.

Was it helpful?

Solution

If your job is just to upper case a single file, then Hadoop isn't really going to give you anything that streaming the file to a single machine, performing the upper case and then writing the contents back up to HDFS. Even with a huge file (say 1TB), you are still going to need to get everything to a single reducer such that when it is written back to HDFS it's stored in a single contiguous file.

In this case i would configure your streaming job to have a single mapper per file (set the split min and max size to something huge, larger than the file itself), and run a map only job.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top