Can already partitioned input data improve the hadoop processing?

https://stackoverflow.com/questions/17307734

01-06-2022
|

Question

I know that during the intermediate steps between mapper and reducer, hadoop will sort and partition the data on its way to the reducer.

Since I am dealing with already partitioned data in my input to the mapper, is there a way to take advantage of it and possibly accelerate the intermediate processing so no more sorting or grouping-by will take place?

Adding some details:

As I store data on S3, let's say I only have two files in my bucket. First file will store records of the lower half users ids, the other file will store values of the upper half of user ids. Data in each file is not necessarily sorted, but it is guaranteed that all data pertaining to a user is located in the same file.

Such as:

\mybucket\file1
\mybucket\file2

File1 content:
User1,ValueX
User3,ValueY
User1,ValueZ
User1,ValueAZ

File2 content:
User9,ValueD
User7,ValueB
User7,ValueD
User8,ValueB

From what I read, I can use a streaming job and two mappers and each mapper will suck in one of the two files, but the whole file. Is this true?

Next, Let's say the mapper will only output a unique Key just once, with the associated value being the number of occurrences of that Key. (which I realize it is more of a reducer responsibility, but just for our example here)

Can the sorting and partitioning of those output keys from the Mapper be disabled and let them fly freely to the reducer(s) ?

Or to give another example: Imagine all my input data contains just one line for each Unique Key, and I don't need that data to be sorted in the final output of the reducer. I just want to Hash the Value for each Key. Can I disable that sorting and partitioning step before the reducer?

Solution

Although for the files shown above you'll get 2 mappers, it can't be guaranteed always. Number of mappers depend upon the number of InputSplits created from the input data. If your files are big you might have more than one mappers.

Partitioning is merely a way to tell which key/value goes to which reducer. If you disable it then you either need some other way to do this or you'll end up with performance degradation, as the inputs to reducers will be uneven. A particular reducer might get all of the input or a particular reducer might get zero input. I can't see any performance gain here. Of course, if you think your custom partitioner fits better into the situation you could definitely do that. But skipping partitioning doesn't sound logical to me. The default partitioning behavior depends on hash itself. After a mapper emits its output keys are hashed to find out which set of key/value pairs goes to which reducer.

And if your data is already sorted and you want to skip the sorting phase in your MR job, you might find the patch provided in response to this JIRA useful. Issue is not closed yet, but it would definitely help you in getting started.

HTH

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow