문제

I am running a series of MapReduce jobs on EMR. However, the 3rd MapReduce job needs the data output from the 2nd MapReduce job, and the output is essentially over a million key-value pairs (both the key and the value are less than 1KB). Is there a good way to store this information in a distributed store on the same machine as the EMR so the subsequent jobs can access the information? I looked at DistributedCache, but it's more for storing files? I am not sure if Hadoop is optimized for storing a million tiny files..

Or maybe I can somehow use another MapReduce job to combine all of the key-value pairs into ONE output file, and then put that entire file into DistributedCache.

Please advise. Thanks!

도움이 되었습니까?

해결책

Usually, the output of a map reduce job is stored in HDFS (or S3). The number of reducers of this job determines the number of output files. How come you have a million of tiny files? Do you run a million reducers? I'm not so sure.

So if you define a single reducer for your 2nd job, you'll automatically end up with a single output file, which will be stored in HDFS. Your 3rd job will be able to access and process this file as input. If the 2nd job needs multiple reducers, you'll have multiple output files. 1 million key-value pairs with key and value of 1 KB each give you a < 2 GB file. With a HDFS block size of 64 MB, you'll end up with result files with size N*64 MB, which will allow the 3rd job to process the blocks in parallel (multiple mappers).

You should use DistributedCache only if the whole file needs to be read in every single mapper. However with a size of max. 2 GB it is a rather flawed approach.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top