Domanda

I want to hold File A in the memory of reducer1 and File B in the memory of reducer2. Is this possible using Distributed Cache technology in hadoop? Or else, is there any other way to acheive this?

Thanks

È stato utile?

Soluzione

Yes if the files are considerably small you can set these files in distributed cache. Follow this link http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata. It might be useful to u.

And if you consider this portion of the code its up to u which file u want to work upon in which reducer.

Path [] cacheFiles = DistributedCache.getLocalCacheFiles(conf);
      if (null != cacheFiles && cacheFiles.length > 0) {
        for (Path cachePath : cacheFiles) {
          if (cachePath.getName().equals(stopwordCacheName)) {
            loadStopWords(cachePath);
            break;
          }
        }

See if it helps

Altri suggerimenti

The problem (and it is a fundamental problem w/Hadoop) is that task tracker launches new JVM process for each task - negating any benefit of in-memory caching. You can configure task tracker to run in the same VM - but many Hadoop internal use System.exit() and so many Hadoop jobs too... making such configuration pointless.

You can co-deploy GridGain cluster along with Hadoop cluster and use GridGain for in-memory caching via quick loopback connectivity between two JVMs on the same physical host (GridGain and Hadoop's task tracker JVMs that is) - something that is done by our several clients.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top