Question

Given a hadoop cluster, I have a job for which I have a large set of files that need to be accessed by all workers while they perform their reduce stage.

It seems it would be a good idea to use the facilities of DistributedCache. However, it appears it does not satisfy the following desired behavior:

  • Lazy file fetching: files are copied to the workers lazily (only when attempted to be read are they cached locally).

  • getLocalCacheFiles is weird: another obviously related problem is that of the DistributedCache interface. to access the local files it seems, one needs to call DistributedCache.getLocalCacheFiles(conf). Is there a way to just request a certain file by name (ex: DistributedCache.getLocalFile(conf, fileName))

Can DistributedCache do this? Is there any other library that satisfies my requirements?

Thank you!

Was it helpful?

Solution

The distributed cache doesn't support lazy loading, they are copied down to each task node prior to the first map / reduce task for your job being executed on that node (note the files are only copied to nodes where a map / reduce task will take place). If you want lazy loading, just open the files directly in HDFS, although this doesn't scale too well for you namenode / datanodes if you have 1000's of concurrent tasks trying to read from the same file

You can use symbolic linking to give files in the distributed cache friendly names, and they will appear in the local working directory (symlinked) for each map / reduce task.

For example, with the generic options parser options -files, you can upload a file to HDFS, add it to the DistributedCache and assign a friendly name as follows:

hadoop jar myjar.jar MainClass -files ref-map.txt#map1.txt ...

Now you should be able to open the ref-map.txt file just by calling the following in your map / reducer:

File map1 = new File("map1.txt");

If you're files are already in HDFS, then just add then as normal, and call the createSymlink(Configuration) method. You can also assign friendly nanes by using fragment URIs when you add files to the distibuted cache:

DistributedCache.addCacheFile(new URI("/path/to/file.txt#file1", conf);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top