문제

I am prepping for an exam and here is a question in the lecture notes:

Why Map tasks outputs are written to the local disk and not to HDFS?

Here are my thoughts:

  • Reduce network traffic usage as the reducer may run on the same machine as the output so copying not required.
  • Don't need the fault tolerance of HDFS. If the job dies halfway, we can always just re-run the map task.

What are other possible reasons? Are my answers reasonable?

도움이 되었습니까?

해결책

Your reasonings are correct.
However I would like to add few points: what if map outputs are written to hdfs.
Now, writing to hdfs is not like writing to local disk. It's a more involved process with namenode assuring that at least dfs.replication.min copies are written to hdfs. And namenode will also run a background thread to make additional copies for under replicated blocks.
Suppose, the user kills the job in between or jobs just fail. There will be lots of intermediate files sitting on hdfs for no reason which you will have to delete manually. And if this process happens too many times, your cluster's perform and will degrade. Hdfs is optimized for appending and not frequent deleting .
Also, during map phase , if the job fails, it performs a cleanup before exiting. If it were hdfs, the deletion process would require namenode to send a block deletion message to appropriate datanodes, which will cause invalidation of that block and it's removal from blocksMap. So much operation involved just for a failed cleanup and for no gain!!

다른 팁

Because it doesn’t use valuable cluster bandwidth. This is called the data locality optimization. Sometimes, however, all the nodes hosting the HDFS block replicas for a map task’s input split are running other map tasks, so the job scheduler will look for a free map slot on a node in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is used, which results in an inter-rack network transfer.

from "Hadoop The Definitive Guide 4 edition"

There is a point I know of writing the map output to Local file system , the output of all the mappers eventually gets merged and finally made a input for shuffling and sorting stages that precedes Reducer phase.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top