I am trying to run a simple word counting map-reduce job on Amazon's Elastic Map Reduce but the output is gibberish. The input file is part of the common crawl files which are hadoop sequence files. The file is supposed to be the extracted text (stripped from html) from the webpages that were crawled.
My AWS Elastic MapReduce step looks like this:
Mapper: s3://com.gpanterov.scripts/mapper.py
Reducer: s3://com.gpanterov.scripts/reducer.py
Input S3 location: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112
Output S3 location: s3://com.gpanterov.output/job3/
The job runs successfully, however the output is gibberish. There are only weird symbols and no words at all. I am guessing this is because hadoop sequence files cannot be read through standard in? However, how do you run a mr job on such a file? Do we have to convert the sequence files into text files first?
The first couple of lines from part-00000 look like this:
'\x00\x00\x87\xa0 was found 1 times\t\n'
'\x00\x00\x8e\x01:\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x05\xc1=K\x02\x01\x00\x80a\xf0\xbc\xf3N\xbd\x0f\xaf\x145\xcdJ!#T\x94\x88ZD\x89\x027i\x08\x8a\x86\x16\x97lp0\x02\x87 was found 1 times\t\n'
Here is my mapper:
#!/usr/bin/env python
import sys
for line in sys.stdin:
words = line.split()
for word in words:
print word + "\t" + str(1)
And my reducer:
#!/usr/bin/env python
import sys
def output(previous_key, total):
if previous_key != None:
print previous_key + " was found " + str(total) + " times"
previous_key = None
total = 0
for line in sys.stdin:
key, value = line.split("\t", 1)
if key != previous_key:
output(previous_key, total)
previous_key = key
total = 0
total += int(value)
output(previous_key, total)
There is nothing wrong with input file. On a local machine I ran hadoop fs -text textData-00112 | less
and this returns pure text from web pages.
Any input on how to run a python streaming mapreduce job on these types of input files (common-crawl hadoop sequence files) is much appreciated.