Hadoop data split and data flow control

https://stackoverflow.com/questions/11941002

26-06-2021
|

Question

I have 2 questions for A hadoop as a storage system.

I have a hadoop cluster of 3 data node and I want to direct splits of a huge file say of size 128mb (assuming that split size is 64mb ) to my choice of data node. That is how to control which split goes to which DataNode in such case. I mean lets say we have 3 data node( ie D1,D2,D3) and we want particular split (let say 'A') which I wish it to move to particular data node let it be D2.

How can we do this ?
What is the smallest possible split size of a hadoop filesystem. How can we configure it to smallest split size.

Solution

1) You can't control where the data blocks are placed

2) As small as you want (should probably be a multiple of 1024 bytes though but i don't think there is an actual constraint in this), but on modern hardware, anything smaller than 64 / 128 MB is inefficient (you can specify a smaller processing split size if you are doing anything CPU intensive in the MR Job)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow