Hadoop “Style” — chunking vs thousands of (k, v) pairs

https://stackoverflow.com/questions/4854133

27-10-2019
|

Question

I'm working with a number of large files that contain matrices of data corresponding to nasa's MODIS grid -- the grid splits the earth's surface up into a 21,600 x 43,200 pixel array. This particular dataset gives one integer value per pixel.

I have about 200 files, one file per month, and need to create a time series for each pixel.

My question is, for a map task that takes one of these files -- should I cut the grid up into chunks of, say, 24,000 pixels, and emit those as values (with location and time period as keys), or simply emit a key, value pair for every single pixel, treating a pixel like a word in the canonical word count example?

The chunking will work fine, it just introduces an arbitrary "chunk size" variable into my program. My feeling is that this will save quite a bit of time on IO, but it's just a feeling, and I look forward to actual informed opinions!

Solution

In a Hadoop project I worked on I can confirm that the number of K,V pairs has a direct impact on the load, CPU time and IO. If you can limit the number of chunks and still retain enough scalability for your situation I would certainly try to go there.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow