Question

I am running a python MapReduce script on top of Amazons EMR Hadoop implementation. As a result from the main scripts, I get item item similiarities. In an aftercare step, I want to split this output into a seperate S3 bucket for each item, so each item-bucket contains a list of items similiar to it. To achieve this, I want to use Amazons boto python library in the reduce function of the aftercare step.

  • How do I import external (python) libraries into hadoop, so that they can be used in a reduce step written in python?
  • Is it possible to access S3 in that way inside the Hadoop environment?

Thanks in advance, Thomas

Was it helpful?

Solution

When launching a hadoop process you can specify external files that should be made available. This is done by using the -files argument.

$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat

I don't know if the files HAVE to be on the HDFS, but if it's a job that will be running often, it wouldn't be a bad idea to put them there.
From the code you can do something similar to

if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
    List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
    for (Path localFile : localFiles) {
        if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
            Path path = new File(localFile.toUri().getPath());
        }
    }
}

This is all but copy and pasted directly from working code inside multiple of our Mappers.

I don't know about the second part of your question. Hopefully the answer to the first part will get you started. :)

In addition to -files there is -libjars for including additional jars; I have a little information about here - If I have a constructor that requires a path to a file, how can I "fake" that if it is packaged into a jar?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top