On further testing, it looks like the multiple wildcard syntax works as expected in the command line client. I had trouble getting it to work at first, before realizing that requiring Ruby 1.8.7 meant it requires exactly Ruby 1.8.7, and nothing later.
Pattern match input files for Amazon Elastic MapReduce
-
14-04-2022 - |
Вопрос
I am trying to run a MapReduce streaming job that takes input files from directories in an s3 bucket that match a given pattern. The pattern is something like bucket-name/[date]/product/logs/[hour]/[logfilename]
. An example log would be in a while like bucket-name/2013-05-02/product/logs/05/log123456789
.
I can get the job to work by passing only the hour portion of the file name as a wildcard. For example: bucket-name/2013-05-02/product/logs/*/
. This successfully picks each log file from each hour, and passes them individually to mappers.
The problem comes with I try to also make the date a wildcard, for example: bucket-name/*/product/logs/*/
. When I do this, the job gets created but no tasks are created and eventually it fails. This error is printed in the syslog.
2013-05-02 08:03:41,549 ERROR org.apache.hadoop.streaming.StreamJob (main): Job not successful. Error: Job initialization failed:
java.lang.OutOfMemoryError: Java heap space
at java.util.regex.Matcher.<init>(Matcher.java:207)
at java.util.regex.Pattern.matcher(Pattern.java:888)
at org.apache.hadoop.conf.Configuration.substituteVars(Configuration.java:378)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:418)
at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:523)
at org.apache.hadoop.mapred.SkipBadRecords.getMapperMaxSkipRecords(SkipBadRecords.java:247)
at org.apache.hadoop.mapred.TaskInProgress.<init>(TaskInProgress.java:146)
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:722)
at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4238)
at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2013-05-02 08:03:41,549 INFO org.apache.hadoop.streaming.StreamJob (main): killJob...
Решение