Question

I am working on Hadoop performance analysis and I am running some benchmarks on Hadoop. What's surprising is that Grep takes almost 1/10 of the time it takes wordcount to run which is very non-intuitive. Can anyone explain why is this true?

Was it helpful?

Solution

A lot of the work in the map-reduce idiom is the communication between mappers and reducers.

In the WordCount example, every word results in an output record (and a reducer input). In the Grep example, every matched pattern results in an output record. If the pattern doesn't match very often, that's not very many records.

I would expect the mappers to run in roughly the same amount of time, since both will be I/O bound, up to the point where they produce output. The CPU difference between the two tasks is negligible. However, a big difference between the amount of output will be highly noticeable.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top