سؤال

With default Hadoop settings, how long it would take to write a 64Mb file into HDFS assuming it takes 4 minutes to write a block.

According to me with 64Mb as default block size, the client has to write a single block which should take 4 * 3[replication factor]=12 minutes.

Reason
HDFS uses pipelining to achieve its replicated writes. When the client receives the list of DataNodes from the NameNode, the client streams the block data to the first DataNode (4 minutes), which in turn mirrors the data to the next DataNode (4 minutes), and so on until the data has reached all of the DataNodes (4 minutes again). Acknowledgements from the DataNodes are also pipelined in reverse order.
4+4+4=12 minutes

Can some one confirm if my understanding is correct ?

هل كانت مفيدة؟

المحلول

Your understanding is along the right path, but not fully correct. Find below a excerpt from the Definitive Guide book

It’s possible, but unlikely, that multiple datanodes fail while a block is being written. As long as dfs.replication.min replicas (which default to one) are written, the write will succeed, and the block will be asynchronously replicated across the cluster until its target replication factor is reached (dfs.replication, which defaults to three).

This suggests that the time taken by the -put command depends on the dfs.replication.min configuration found in the hdfs-default.xml

By default this is 1. So based on your example, it should take around 4 mins to complete.

As -put will wait for ack only from one data node.

نصائح أخرى

I agree with Sudarshan's comment , just to add for timing aspect -<
say you have dfs.replication.min is set to 2 and dfs.replication set to 3, then put operation with simple time command gives you time taken by for 2 replications.
eg time hadoop fs -put filename destDir.
Now , you can read you file.

However your file will still be under replicated and namenode will try to repplicate it to a third datanode. Since this operation is asynchronous, you never know how much time it may take. But whenever it gets gets done, namenode will register it as a normal block which you can see by fsck

I try to put a 6G file to HDFS use hadoop fs -put filename destDir.When compeleted ,i use df to see hard disk usage ,the Used do not grow any more .I think -put will wait for ack from 3 replication competed.

To verify my it. I try put a file to HDFS use 3 replication three times,it take 43s\35s\40s . then,I edit dfs.replication to 1 ,it take 7s\5s\6s. i use same file.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top