I'm writing a Hadoop application but it seems that I have misinterpreted how exactly hadoop works. My Input files are tiles of a map, named according to the QuadTile principle. I need to subsample those, and stitch those together until I have a certain higher-level tile which covers a larger area but at a lower resolution. Like zooming out in google maps.
One of the things I have done is that I have written a mapper which executes on every (unsplittable) tile like this:
public void map(Text keyT, ImageWritable value, Context context) throws IOException, InterruptedException {
String key = keyT.toString();
//check whether file needs to be processed
if(key.startsWith(context.getJobName(), 0)){
String newKey = key.substring(0, key.length()-1);
ImageWritable iw = subSample(value);
char region = key.charAt(key.length()-1);
iw.setRegion(region);
context.write(new Text(newKey), iw);
}else{
//tile not needed in calculation
}
}
My reducer looks like this:
public void reduce(Text key, Iterable<ImageWritable> values, Context context) throws IOException, InterruptedException{
ImageWritable higherLevelTile = new ImageWritable();
int i = 0;
for(ImageWritable s : values){
int width = s.getWidth();
int height = s.getHeight();
char c = Character.toUpperCase(s.getRegion());
int basex=0, basey=0;
if(c=='A'){
basex = basey = 0;
}else if(c=='B'){
basex = width;
basey = 0;
}else if(c=='C'){
basex = 0;
basey = height;
}else{
basex = width;
basey = height;
}
BufferedImage toDraw = s.getBufferedImage();
Graphics g = higherLevelTile.getBufferedImage().getGraphics();
g.drawImage(toDraw, basex, basey, null);
}
context.write(key, higherLevelTile);
}
As you maybe can derive from my code I expected hadoop to execute in the following way:
1) Map all tiles of level one
2) Do a first reduce. here I expected the Iterable values to have four elements: the four subsampled tiles of the lower level.
3) Map al tiles currently in context
4) reduce all tiles in context. Again, Iterable values will have 4 elements...
5) ... repeat...
6) when no more maps left -> write output
Turns out, that is not correct. My reducer is called after every Map, and Iterable never seems to have more than one element. I tried to fix that by altering the reducer code a bit by assuming Iterable would have 2 elements: one subsampled value, and one partially finished higher-level tile. Turns out, that is not correct either.
Can anyone tell me, or point me towards, how the flow of hadoop actually is? What should I do to make my use-case work? I hope I explained it clearly.