Question

I'm trying simple bi-gram (word pair) count, I tried with simple "pair" approach, now I just modified to try "Stripes" approach, but in Cleanup routine of mapper, somehow my all keys are same word pair (as in last word pair!) and counts.

e.g. text input is:

My name is foo. Foo is new to Hadoop.

In mapper my hashmap looks like:

((my, name), 1), ((name, is), 1), ((is, foo), 2), ((is, new), 1), ((new, to), 1), ((to, hadoop), 1)

But in Cleanup routine, I tried to print same hashmap, it looks like

((to, hadoop), 1), ((to, hadoop), 1), ((to, hadoop), 2), ((to, hadoop), 1), ((to, hadoop), 1), ((to, hadoop), 1)

My code looks like:

Map Class:
private HashMap<TextPair, Integer> h = new HashMap<TextPair, Integer>();;

void map(...) :
    ...
StringTokenizer itr = new StringTokenizer(value.toString());            
left = itr.nextToken();
while(itr.hasMoreTokens()) {
right = itr.nextToken();

if(left != null && right!= null) {
            **//I have to create new TextPair(key object) each time!** 
    key.set(new Text(left.toLowerCase()), new Text(right.toLowerCase()));
    //If key is there, just do count + 1 else add key with value 1
    if(h.containsKey(key)) {
            int total = h.get(key) + 1;         
        h.put(key, total);
    } else {
        System.out.println("key: "+ key.toString()+ " => 1");                       
        h.put(key, 1);
    }
            //context.write(key, one);
    }
    left = right;
}
    ....

void cleanup(...):
   Iterator<Entry<TextPair, Integer>> itr = h.entrySet().iterator();
   while(itr.hasNext()) {
    Entry<TextPair, Integer> entry = itr.next();
    TextPair key = entry.getKey();
    int total = entry.getValue().intValue();
    System.out.println("--- MAP CLEANUP ---: key: "+ key.toString() + " => Total: "+ total);

    context.write(key, new IntWritable(total));
}
...

Note: TextPair is my custom key class. Any suggestion?

EDIT 1:

does cleanup routine of Map, executed at last after all map tasks are done? And hash is kind of "global", something wrong with that or my iterator?

EDIT 2:

I have to create new TextPair Key object at each iteration in map() before hashing, that's what the issue was .. its solved, but wondering why so? I used hash with Python so many times, its good, no pain, why I need to create new object each time, I don't understand.

Was it helpful?

Solution

It seems that you don't create new key each time, but reuse the one. So you got the same distribution in both cases and the last key in first set is used everywhere in the second set.

OTHER TIPS

I assume you are using the TextPair class example from 'Hadoop: The Definitive Guide'?

The problem is that it isn't safe to use a mutable Object as a key value for a HashMap. Instead, you should use immutable keys, like the primitives. Since the TextPair class example from the guide is mutable, problems may arise when putting/getting/removing values through the key objects.

One way to work around this problem is to create a new TextPair Object every time, as you already did. Another way to solve it, is to use the SimpleImmutableEntry class.

I encountered the same problem you have and solved it by implementing a version with SimpleImmutableEntry's.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top