Unable to read Hadoop Sequence files through stdin with a streaming python map-reduce on AWS

https://stackoverflow.com/questions/21214603

29-09-2022
|

Question

I am trying to run a simple word counting map-reduce job on Amazon's Elastic Map Reduce but the output is gibberish. The input file is part of the common crawl files which are hadoop sequence files. The file is supposed to be the extracted text (stripped from html) from the webpages that were crawled.

My AWS Elastic MapReduce step looks like this:

Mapper: s3://com.gpanterov.scripts/mapper.py
Reducer: s3://com.gpanterov.scripts/reducer.py
Input S3 location: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112
Output S3 location: s3://com.gpanterov.output/job3/

The job runs successfully, however the output is gibberish. There are only weird symbols and no words at all. I am guessing this is because hadoop sequence files cannot be read through standard in? However, how do you run a mr job on such a file? Do we have to convert the sequence files into text files first?

The first couple of lines from part-00000 look like this:

'\x00\x00\x87\xa0 was found 1 times\t\n'
'\x00\x00\x8e\x01:\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x05\xc1=K\x02\x01\x00\x80a\xf0\xbc\xf3N\xbd\x0f\xaf\x145\xcdJ!#T\x94\x88ZD\x89\x027i\x08\x8a\x86\x16\x97lp0\x02\x87 was found 1 times\t\n'

Here is my mapper:

#!/usr/bin/env python

import sys

for line in sys.stdin:
    words = line.split()
    for word in words:
      print word + "\t" + str(1)

And my reducer:

#!/usr/bin/env python

import sys

def output(previous_key, total):
    if previous_key != None:
      print previous_key + " was found " + str(total) + " times"

previous_key = None
total = 0

for line in sys.stdin:
    key, value = line.split("\t", 1)
    if key != previous_key:
      output(previous_key, total)
      previous_key = key
      total = 0 
    total += int(value)

output(previous_key, total)

There is nothing wrong with input file. On a local machine I ran hadoop fs -text textData-00112 | less and this returns pure text from web pages. Any input on how to run a python streaming mapreduce job on these types of input files (common-crawl hadoop sequence files) is much appreciated.

Solution

You need to provide SequenceFileAsTextInputFormat as the inputformat to hadoop streaming jar.

I have never used amazon aws mapreduce, but on a normal hadoop installation it would be done like this:

HADOOP=$HADOOP_HOME/bin/hadoop
$HADOOP jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
  -input <input_directory>
  -output <output_directory> \
  -mapper "mapper.py" \
  -reducer "reducer.py" \
  -inputformat SequenceFileAsTextInputFormat

OTHER TIPS

The suggestion by Sunny Nanda fixed the issue. Adding -inputformat SequenceFileAsTextInputFormat to the extra arguments box in the aws elastic mapreduce API worked and the output from the job is as expected.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow