Hadoop: How to collect output of Reduce into a Java HashMap

https://stackoverflow.com/questions/19109067

29-06-2022
|

Question

I'm using Hadoop to compute co-occurrence similarity between words. I have a file that consists of co-occurring word pairs that looks like:

a b
a c
b c
b d

I'm using a Graph based approach that treats words as nodes and co-occurring words have an edge between them. My algorithm needs to compute the degree of all nodes. I've successfully written a Map-Reduce job to compute the total degree which outputs the following:

a 2
b 3
c 2
d 1

Currently, the output is written back to a file but what I want instead is to capture the result into, say, a java.util.HashMap. I, then, want to use this HashMap in an other Reduce job to compute the final similarity.

Here are my questions:

Is it possible to capture results of reduce job in memory (List, Map). If so, how ?
Is this the best approach ? If not, How should I deal with this ?

Solution

There's two possibilities: Or you read the data in your map/reduce task from the distributed file system. Or you add it directly to the distributed cache. I just googled distributed cache size, and it can be controlled:

"The local.cache.size parameter controls the size of the DistributedCache. By default, it’s set to 10 GB."

Link to cloudera blog

So if you add the output of your first job to the distributed cache of the second you should be fine I think. Tens of thousands of entries are nowhere near the gigabyte range.

Adding a file to the distributed cache goes as follows:

TO READ in your mapper:

Path[] uris = DistributedCache.getLocalCacheFiles(context.getConfiguration());
String patternsFile = uris[0].toString();
BufferedReader in = new BufferedReader(new FileReader(patternsFile));

TO ADD to the DBCache:

DistributedCache.addCacheFile(new URI(file), job.getConfiguration());

while setting up your second job.

Let me know if this does the trick.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow