hadoop lazy distributedcache

Question

The distributed cache doesn't support lazy loading, they are copied down to each task node prior to the first map / reduce task for your job being executed on that node (note the files are only copied to nodes where a map / reduce task will take place). If you want lazy loading, just open the files directly in HDFS, although this doesn't scale too well for you namenode / datanodes if you have 1000's of concurrent tasks trying to read from the same file

You can use symbolic linking to give files in the distributed cache friendly names, and they will appear in the local working directory (symlinked) for each map / reduce task.

For example, with the generic options parser options -files, you can upload a file to HDFS, add it to the DistributedCache and assign a friendly name as follows:

hadoop jar myjar.jar MainClass -files ref-map.txt#map1.txt ...

Now you should be able to open the ref-map.txt file just by calling the following in your map / reducer:

File map1 = new File("map1.txt");

If you're files are already in HDFS, then just add then as normal, and call the createSymlink(Configuration) method. You can also assign friendly nanes by using fragment URIs when you add files to the distibuted cache:

DistributedCache.addCacheFile(new URI("/path/to/file.txt#file1", conf);