I am using Mapper to load a large amount of data which has the execution time and a large query associated with it.. I just need to find the 1000 most expensive queries, so i feed the execution time as the key to the output of my mapper. I use 1 reducer, and want only 1000 records to be written and the reducer to stop processing.

I can have a global counter and do this if(count < 1000) { context.write(key,value) }

But this will still load all the billions of records and then not write them.

I want the reducer to stop after spitting 1000 records. There by avoiding the seek time and read time for the next set of records.

Is this possible??

有帮助吗?

解决方案

You can fully shortcut your reducer by overriding the default implementation of the Reducer.run() method:

public void run(Context context) throws IOException, InterruptedException {
  setup(context);
  while (context.nextKey()) {
    reduce(context.getCurrentKey(), context.getValues(), context);
  }
  cleanup(context);
}

You should be able to amend the while loop to include your counter as follows:

public void run(Context context) throws IOException, InterruptedException {
  setup(context);
  int count = 0;
  while (context.nextKey() && count++ < 1000) {
    reduce(context.getCurrentKey(), context.getValues(), context);
  }
  cleanup(context);
}

Not that this won't necessarily output the top most records, merely the first 1000 keyed records (and will not work if your reduce implementation outputs more that a single record - in which case you can increment the counter in the reduce method)

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top