Distibuted Cache in Reduce Hadoop

https://stackoverflow.com/questions/12555352

03-07-2021
|

Question

I want to hold File A in the memory of reducer1 and File B in the memory of reducer2. Is this possible using Distributed Cache technology in hadoop? Or else, is there any other way to acheive this?

Thanks

Solution

Yes if the files are considerably small you can set these files in distributed cache. Follow this link http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata. It might be useful to u.

And if you consider this portion of the code its up to u which file u want to work upon in which reducer.

Path [] cacheFiles = DistributedCache.getLocalCacheFiles(conf);
      if (null != cacheFiles && cacheFiles.length > 0) {
        for (Path cachePath : cacheFiles) {
          if (cachePath.getName().equals(stopwordCacheName)) {
            loadStopWords(cachePath);
            break;
          }
        }

See if it helps

OTHER TIPS

The problem (and it is a fundamental problem w/Hadoop) is that task tracker launches new JVM process for each task - negating any benefit of in-memory caching. You can configure task tracker to run in the same VM - but many Hadoop internal use System.exit() and so many Hadoop jobs too... making such configuration pointless.

You can co-deploy GridGain cluster along with Hadoop cluster and use GridGain for in-memory caching via quick loopback connectivity between two JVMs on the same physical host (GridGain and Hadoop's task tracker JVMs that is) - something that is done by our several clients.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow