Hadoop Streaming - external mapper script - file not found

https://stackoverflow.com/questions/20218521

05-08-2022
|

Domanda

Trying to run a mapreduce job on Hadoop using Streaming. I have two ruby scripts wcmapper.rb and wcreducer.rb. I'm attempting to run the job as follows:

hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -file wcmapper.rb -mapper wcmapper.rb -file wcreducer.rb -reducer wcreducer.rb -input test.txt -output output

This results in the following error message at the console:

13/11/26 12:54:07 INFO streaming.StreamJob:  map 0%  reduce 0%
13/11/26 12:54:36 INFO streaming.StreamJob:  map 100%  reduce 100%
13/11/26 12:54:36 INFO streaming.StreamJob: To kill this job, run:
13/11/26 12:54:36 INFO streaming.StreamJob: /home/paul/bin/hadoop-1.2.1/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201311261104_0009
13/11/26 12:54:36 INFO streaming.StreamJob: Tracking URL: http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201311261104_0009
13/11/26 12:54:36 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201311261104_0009_m_000000
13/11/26 12:54:36 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

Looking at the failed attempts for any of the tasks shows:

java.io.IOException: Cannot run program "/var/lib/hadoop/mapred/local/taskTracker/paul/jobcache/job_201311261104_0010/attempt_201311261104_0010_m_000001_3/work/./wcmapper.rb": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1042)

I understand that hadoop needs to copy the map and reducer scripts for use by all the nodes and believe this is the purpose of the -file arguments. However it seems the scripts are not being copied to the location where hadoop expects to find them. The console indicates they are being packaged I think:

packageJobJar: [wcmapper.rb, wcreducer.rb, /var/lib/hadoop/hadoop-unjar3547645655567272034/] [] /tmp/streamjob3978604690657430710.jar tmpDir=null

I have also tried the following:

hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -files wcmapper.rb,wcreducer.rb -mapper wcmapper.rb -reducer wcreducer.rb -input test.txt -output output

but this gives the same error.

Can anyone tell me what the problem is?

Or where to look to better diagnose the issue?

Many thanks

Paul

Soluzione

Sorry found the answer.

The scripts had been downloaded as part of the Packt "Hadoop Beginner's Guide"

They originally had the shebang set as:

#!/usr/bin/env ruby

but this had generated a file not found error for ruby itself. Checking the details of env showed it used the PATH variable to determine the location of ruby. The ruby exe was in /usr/bin and this was in the PATH. However, I amended this to:

#!/usr/bin/ruby

and this fixed the original file not found error but produced the error in the question above.

I finally tried to run the Ruby scripts themselves, at the console, and this gave the result:

[paul@lt001 bin]$ ./wcmapper.rb 
bash: ./wcmapper.rb: /usr/bin/ruby^M: bad interpreter: No such file or directory

This seemed odd as the exe existed in the directory shown.

I then recreated the script files (by typing them in at the console. This fixed the problem (with the scripts running both at the console and in hadoop). My assumption is that the format of the files themselves (possibly the ^M) was at fault.

In summary it was the interpreter that the "file not found" error related to even tho' the file listed in the task log was the script file itself.

Hope that helps some one.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow