Question

I have a lot of files stored in the distributed cache, each corresponding to a user id. I want to attach a specific file corresponding to a particular user id(which will be the key of the reducer), to a particular reduce task. But I am not able to do so, because, I read the files from distributed cache using the configure method, which comes before the reduce method in the reduce class. So I cannot access the keys of the reduce method in the configure method of the reduce class, and hence can't read only the file that I want to. Pls help me.

class reduce{

void configure(args)
{

/*I can a particular file from the Path[] here.
I want to select the  file corresponding to the key of the reduce method and pass its
contents to the reduce method. I am not able to do this as I can't access the key of 
the reduce method.*/

}

void reduce(args)
{
}


}
Was it helpful?

Solution

A solution is to assign the Path arrays from the DistributedCache to class variables during the configure step as described in the DistributedCache javadocs. Of course, replace the map code with your reduce code.

This is using the old API, which it looked like your code was using.

 public static class MapClass extends MapReduceBase  
 implements Mapper<K, V, K, V> {

   private Path[] localArchives;
   private Path[] localFiles;

   public void configure(JobConf job) {
     // Get the cached archives/files
     localArchives = DistributedCache.getLocalCacheArchives(job);
     localFiles = DistributedCache.getLocalCacheFiles(job);
   }

   public void map(K key, V value, 
                   OutputCollector<K, V> output, Reporter reporter) 
   throws IOException {
     // Use data from the cached archives/files here
     // ...
     // ...
     output.collect(k, v);
   }
 }
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top