Question

I would like to add a specific subset of records to be merged with each chunk of records at each mapper, How can I do this in Hadoop generally? and in Python streaming package mrJob?

Was it helpful?

Solution

DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.

Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The DistributedCache assumes that the files specified via hdfs:// urls are already present on the FileSystem at the path specified by the url.

The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.

DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. Jars may be optionally added to the classpath of the tasks, a rudimentary software distribution mechanism. Files have execution permissions. Optionally users can also direct it to symlink the distributed cache file(s) into the working directory of the task.

DistributedCache tracks modification timestamps of the cache files. Clearly the cache files should not be modified by the application or externally while the job is executing.

For Python mrJob :-

I think You have to use

mrjob.compat.supports_new_distributed_cache_options(version)

And then use -files and -archives instead of -cacheFile and -cacheArchive

May be you will get more here

OTHER TIPS

Not sure what exactly you're trying to do, but maybe you can use the Distributed Cache feature to achieve this.

Sample use case for Distributed Cache:

Input to mapper: customer reviews
You want to process only those reviews which contain certain keywords, which are stored in a "words.txt" file. 
You can put words.txt into the Distributed Cache, which makes it available to the mapper & reducer. 

Not sure how exactly it is done for Python streaming, but it should not be difficult to figure out

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top