For MapReduce bechmarks, when I finish running them, am I able to know what input/shuffle/output data size are, respectively?

StackOverflow https://stackoverflow.com/questions/23531790

  •  17-07-2023
  •  | 
  •  

سؤال

I read some papers about analyzing workloads input/shuffle/output data size. So I have questions that after I finish running TestDFSIO, Teragen, Terasort, Teravalidate, and Wordcount benchmarks, can I know what input/shuffle/output data size are, respectively?

For example, if I run:

TestDFSIO,

hadoop jar hadoop-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000

hadoop jar hadoop-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000

Terasort,

hadoop jar hadoop-examples.jar teragen 10000000000 output dir

hadoop jar hadoop-examples.jar terasort input dir output dir

hadoop jar hadoop-examples.jar teravalidate terasort output dir (= input data) teravalidate output dir

What are the input/shuffle/output data size for each benchmark?

Thank you!

هل كانت مفيدة؟

المحلول

Yes, you can . However, since your question is too broad, I will give you examples for TestDFSIO only, which is designed to measure HDFS data transfer performance.
TestDFSIO support following arguments : -read | -write | -clean [-nrFiles N] [-fileSize MB] [-resFile resultFileName] [-bufferSize Bytes] .
Now, before benchmarking read operation , you have to write something which you do with something like hadoop jar hadoop-test-1.2.1.jar TestDFSIO -read -nrFiles 10 -fileSize 100. Here fileSize is your input size for 1 file and multiplying with nrFiles100*10 mb = 1000mb on hdfs. you can find the exact size of output file under /benchmarks/TestDFSIO/io_data directory.
You will some other directories also - io_control (contains file name that was read or written and filesize).
About shuffle - its an intermediate operation. So , to know about it just look the console outputs the time mapreduce was running or you can go do jobtracker's UI to see it.
And, input for write operation is generated by TestDFSIO class. Its just some bytes calculated by mod operation based on buffersize.
You also get a log file generated which contains IO and throughput stats.
Hope this clarifies some of the stuff and gives you a headstart. There are lots of benchmarks which you can explore further.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top