I am writing an external script to run a mapreduce job via the Python mrjob module on my laptop (not on Amazon Elastic Compute Cloud or any large cluster).

I read from the mrjob documentation that I should use MRJob.make_runner() to run a mapreduce job from a separate python script as follows.

mr_job = MRYourJob(args=['-r', 'emr'])
with mr_job.make_runner() as runner:
    ...

However, how do I specify which input file to use? I want to use a file "datalines.txt" in the same directory as my mapreduce script and other python script that runs the map reduce. Furthermore, how do I specify the output?

I could not find a function in the mrjob documentation that allows me to specify these parameters.

有帮助吗?

解决方案

Getting started guide suggests that the input is read from stdin or files supplied at the command-line:

mr_job = MRYourJob(args=["datalines.txt"])
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top