Specifying other user owned S3 buckets in EMR job flows

https://stackoverflow.com/questions/18399691

26-06-2022
|

Question

I am trying to use an S3 bucket as input data for my Elastic Map Reduce job flow. The S3 bucket does not belong to the same account as the EMR job flow. How and where should I specify the S3 bucket credentials to access the respective S3 bucket. I tried the following format:

s3n://<Access Key>:<Secret Key>@<BUCKET>

But it gives me the following error:

Exception in thread "main" java.lang.IllegalArgumentException: The bucket name parameter must be specified when listing objects in a bucket
at com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2381)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:444)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:785)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.ensureBucketExists(Jets3tNativeFileSystemStore.java:80)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.fs.s3native.$Proxy1.initialize(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:512)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1413)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:68)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1431)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:256)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:352)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:321)
at com.inmobi.appengage.emr.mapreduce.TestSession.main(TestSession.java:88)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

How do I go about specifying the same?

Solution

You should try to add these credentials to the core-site.xml file. You can add the s3 credentials manually in the nodes or by using the boostrap action while launching the cluster.

You can launch the cluster with something like this:

ruby elastic-mapreduce --create --alive --plain-output --master-instance-type m1.xlarge --slave-instance-type m1.xlarge --num-instances 11 --name "My Super Cluster" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args -c,fs.s3.awsAccessKeyId=< access-key >,-c,fs.s3.awsSecretAccessKey=< secret-key >

This should override the default values which are placed by EMR as per the account launching the cluster.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow