Question

I'm trying to do some data analysis on Amazon Elastic MapReduce. The mapper step is a python script which includes a call to a compiled C++ binary called "./formatData". For example:

# myMapper.py
from subprocess import *
inputData = sys.stdin.readline()
# ...
p1 = Popen('./formatData', stdin=PIPE, stdout=PIPE)
p1Output = p1.communicate(input=inputData)
result = ... # manipulate the formatted data
print "%s\t%s" % (result,1)

Can I call a binary executable like this on Amazon EMR? If so, where would I store the binary (in S3?), for what platform should I compile it, and how I ensure my mapper script has access to it (ideally it would be in the current working directory).

Thanks!

Was it helpful?

Solution

You can call the binary that way, if you make sure the binary gets copied to the worker nodes correctly.

See:

https://forums.aws.amazon.com/thread.jspa?threadID=35158

For an explanation on how to use the distributed cache to make the binary files accessible on the worker nodes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top