Question

I've been working on my graduation project, the topic we're trying to achieve is doing some image search on Hadoop. We're using OpenCV library for image processing. So far we've get a prototype working, but the efficiency does not meet our expectation.

Now we wish to make sure our map job is correctly assigned to the slave node where the data is located.(i.e. We wish the job assigned to that node to process only chunks on the exact node) And in my understanding that means I need to know whereabouts of each chunk of my data, and some other information about chunk.(e.g. which chunk is assigned to which slave node...etc) I've found some information on Http administration interface.(the one that use port 50030 by default) But it's not enough and it's time consuming to gather information I need. So, is there any way to see such information? Any Log files or API?

And, if we're not satisfied with the way Hadoop scheduler assign our tasks, is there a way to interfere how each individual chunk is assigned? Or how Hadoop split inputs? I know it's gonna be a nightmare to manually do all the work originally done by scheduler, but I wish to keep this as our last resort.

To make a long story short,

  1. Can I get any chunk level information through Logs/API from Hadoop? Like how many chunk are there, the location of there chunks...etc, the information on the Http interface for jobtracker is not enough.
  2. Is there any way to interfere with job assignment, chunk assignment, and the way Hadoop split inputs?

Thanks in advance.

Was it helpful?

Solution

  1. You can can get the block/chunk information through code or via the command line. See How to check the distributed data over hdfs for command line info.

  2. Yes there is, probably. You can override the InputSplit/RecordReader to modify how inputs are split, but you may not be able to easily do exactly what you want.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top