Yes, you can . However, since your question is too broad, I will give you examples for TestDFSIO
only, which is designed to measure HDFS data transfer performance.
TestDFSIO support following arguments : -read | -write | -clean [-nrFiles N] [-fileSize MB] [-resFile resultFileName] [-bufferSize Bytes]
.
Now, before benchmarking read operation , you have to write something which you do with something like hadoop jar hadoop-test-1.2.1.jar TestDFSIO -read -nrFiles 10 -fileSize 100
. Here fileSize
is your input size for 1 file and multiplying with nrFiles100*10 mb = 1000mb
on hdfs. you can find the exact size of output file under /benchmarks/TestDFSIO/io_data
directory.
You will some other directories also - io_control (contains file name that was read or written and filesize).
About shuffle - its an intermediate operation. So , to know about it just look the console outputs the time mapreduce was running or you can go do jobtracker's UI to see it.
And, input for write operation is generated by TestDFSIO class. Its just some bytes calculated by mod operation based on buffersize
.
You also get a log file generated which contains IO and throughput stats.
Hope this clarifies some of the stuff and gives you a headstart. There are lots of benchmarks which you can explore further.
For MapReduce bechmarks, when I finish running them, am I able to know what input/shuffle/output data size are, respectively?
-
17-07-2023 - |
Pergunta
I read some papers about analyzing workloads input/shuffle/output data size. So I have questions that after I finish running TestDFSIO, Teragen, Terasort, Teravalidate, and Wordcount benchmarks, can I know what input/shuffle/output data size are, respectively?
For example, if I run:
TestDFSIO,
hadoop jar hadoop-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000
hadoop jar hadoop-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000
Terasort,
hadoop jar hadoop-examples.jar teragen 10000000000 output dir
hadoop jar hadoop-examples.jar terasort input dir output dir
hadoop jar hadoop-examples.jar teravalidate terasort output dir (= input data) teravalidate output dir
What are the input/shuffle/output data size for each benchmark?
Thank you!
Solução