Pattern match input files for Amazon Elastic MapReduce

https://stackoverflow.com/questions/16344143

14-04-2022
|

Question

I am trying to run a MapReduce streaming job that takes input files from directories in an s3 bucket that match a given pattern. The pattern is something like bucket-name/[date]/product/logs/[hour]/[logfilename]. An example log would be in a while like bucket-name/2013-05-02/product/logs/05/log123456789.

I can get the job to work by passing only the hour portion of the file name as a wildcard. For example: bucket-name/2013-05-02/product/logs/*/. This successfully picks each log file from each hour, and passes them individually to mappers.

The problem comes with I try to also make the date a wildcard, for example: bucket-name/*/product/logs/*/. When I do this, the job gets created but no tasks are created and eventually it fails. This error is printed in the syslog.

2013-05-02 08:03:41,549 ERROR org.apache.hadoop.streaming.StreamJob (main): Job not successful. Error: Job initialization failed:
java.lang.OutOfMemoryError: Java heap space
    at java.util.regex.Matcher.<init>(Matcher.java:207)
    at java.util.regex.Pattern.matcher(Pattern.java:888)
    at org.apache.hadoop.conf.Configuration.substituteVars(Configuration.java:378)
    at org.apache.hadoop.conf.Configuration.get(Configuration.java:418)
    at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:523)
    at org.apache.hadoop.mapred.SkipBadRecords.getMapperMaxSkipRecords(SkipBadRecords.java:247)
    at org.apache.hadoop.mapred.TaskInProgress.<init>(TaskInProgress.java:146)
    at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:722)
    at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4238)
    at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

2013-05-02 08:03:41,549 INFO org.apache.hadoop.streaming.StreamJob (main): killJob...

Solution

On further testing, it looks like the multiple wildcard syntax works as expected in the command line client. I had trouble getting it to work at first, before realizing that requiring Ruby 1.8.7 meant it requires exactly Ruby 1.8.7, and nothing later.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow