Question

I am new to MapReduce and would like your opinion on the best MapReduce approach for the following task.

I have a single large document in the format

1 2 3
2
2 3 4 5

Each line has a list of numbers. I want to list each possible (pair) combination of numbers in any line. And I want the number of lines containing each given pair.

The result will be like

element1 element2 occurrences
1        1        1
1        2        1
1        3        1
2        2        3
2        3        2
3        3        2
3        4        1
3        5        1

There are about 2M lines in the documents, and about 1.5M different numbers. And there will be about 2.5G different pairs of numbers to be counted.

The stright forward pseudo code is like: Invoke map for each line in the document

Map(int lineId, list<int> elements)
{
  for each pair of integers in elements
    emit(pair, 1)
}

Reduce((int, int) pair, list<int> counts)
{
  return sum(counts)
}

But this approach will require 2M mappers and 2.5G reducers. Is this a plausible way to go? Planning on trying Hadoop on Azure.

Was it helpful?

Solution

But this approach will require 2M mappers and 2.5G reducers. Is this a plausible way to go? Planning on trying Hadoop on Azure.

This assumption is not correct.

The number of mappers for the FileInputFormat is equal to the number of Input Splits. An Input Split can map to a block in HDFS, which is defaulted to 64MB. So, if the input file is 1024 MB, then 16 map tasks will be launched.

The number of reducers is configurable using the mapred.reduce.tasks parameter which is defaulted to 1. Also, note that a combiner can be used to make the job complete faster.

Would suggest to go through the Hadoop - The Definitive Guide for better understanding of MapReduce and Hadoop.

OTHER TIPS

In short, and I'm no expert, but I would do exactly that. @Thomas Jungblut's point is important - a mapper will fire for each block of each file (size is configurable, and up to a maximum) so you won't have as many mappers as you think, besides - the point of using a platform like Hadoop, to some degree, is to let it figure it out. your logic is correct.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top