Use Distributed Cache - HIVE STREAMING

https://stackoverflow.com/questions/19736515

03-07-2022
|

Question

I would like to zip the files of a Ruby gem, and distribute them to my EMR cluster. I would also like to use a simple Ruby script that references the files in this this gem in a Hive Streaming job.

I add both the file and the archive to the Hadoop Distributed Cache using:

ADD FILE /home/user/mobile.rb; 
ADD ARCHIVE /home/user/browser-master.zip;

Inside mobile.rb, I am using the code below to simulate using the gem:

$.push File.expand_path("../browser-master/lib", __FILE__)
require "browser"

When I have the unzipped archive and the mobile.rb file in the same dir on my local machine, I can stream data to it and run the program just fine.

But when I add the files to my Hadoop cluster I get this error:

FAILED: Execution Error, return code 20003 from org.apache.hadoop.hive.ql.exec.MapRedTask. An error occurred when trying to close the Operator running your custom script.

Does my mobile.rb need to point to something else when the archive is unzipped in the Distributed Cache?

I am using Hive 0.11.

Solution

After doing some testing, adding the entire directory (unzipped) using ADD FILE seemed to work:

ADD FILE /home/user/browser-master

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow