How can we pass arguments for Hadoop Streaming from AWS SDK for PHP?

https://stackoverflow.com/questions/9976983

28-05-2021
|

Question

I'm trying to add some job via AWS SDK for PHP. I'm able to successfully start a cluster and start new job flow via API but I'm getting an error while trying to create Hadoop Streaming step.

Here is my code:

// add some jobflow steps
$response = $emr->add_job_flow_steps($JobFlowId, array(
    new CFStepConfig(array(
        'Name' => 'MapReduce Step 1. Test',
        'ActionOnFailure' => 'TERMINATE_JOB_FLOW',
        'HadoopJarStep' => array(
    'Jar' => '/home/hadoop/contrib/streaming/hadoop-streaming.jar',
            // ERROR IS HERE!!!! How can we pas the parameters?
    'Args' => array(
                '-input s3://logs-input/appserver1 -output s3://logs-input/job123/ -mapper s3://myscripts/mapper-apache.php -reducer s3://myscripts/reducer.php',
              ),
        )
   )),
));

I'm getting error like: Invalid streaming parameter '-input s3://.... -output s3://..... -mapper s3://....../mapper.php -reducer s3://...../reducer.php"

So it is not clear how can I pass the arguments to Hadoop Streaming JAR ?

Official AWS SDK for PHP documentation doesn't provides any examples or documentation.

Possibly related unanswered thread:

Pass parameters to hive script using aws php sdk

Solution

This worked for me:

'Args' => array( '-input','s3://mybucket/in/','-output','s3://mybucket/oo/',
                '-mapper','s3://mybucket/c/mapperT1.php',
                    '-reducer','s3://mybucket/c/reducerT1.php')

OTHER TIPS

I haven't performed these steps with the AWS SDK for PHP yet, but from other environments I'd figure that the way you specify the Amazon S3 locations might not be correct - I think they need to be as follows for your input and output parameters:

s3n://logs-input/appserver1
s3n://logs-input/job123/

Please note usage of the s3n: vs. s3: URI scheme, which might be a requirement for Amazon EMR as per the respective FAQ How does Amazon Elastic MapReduce use Amazon EC2 and Amazon S3?:

Customers upload their input data and a data processing application into Amazon S3. Amazon Elastic MapReduce then launches a number of Amazon EC2 instances as specified by the customer. The service begins the job flow execution while pulling the input data from Amazon S3 using S3N protocol into the launched Amazon EC2 instances. Once the job flow is finished, Amazon Elastic MapReduce transfers the output data to Amazon S3, where customers can then retrieve it or use as input in another job flow. [emphasis mine]

Appendix

The difference between the two URI schemes is explained in the Hadoop Wiki, see AmazonS3:

Hadoop provides two filesystems that use S3.

S3 Native FileSystem (URI scheme: s3n) A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3. For this reason it is not suitable as a replacement for HDFS (which has support for very large files).

S3 Block FileSystem (URI scheme: s3) A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow