Question

I am using Mapper to load a large amount of data which has the execution time and a large query associated with it.. I just need to find the 1000 most expensive queries, so i feed the execution time as the key to the output of my mapper. I use 1 reducer, and want only 1000 records to be written and the reducer to stop processing.

I can have a global counter and do this if(count < 1000) { context.write(key,value) }

But this will still load all the billions of records and then not write them.

I want the reducer to stop after spitting 1000 records. There by avoiding the seek time and read time for the next set of records.

Is this possible??

Était-ce utile?

La solution

You can fully shortcut your reducer by overriding the default implementation of the Reducer.run() method:

public void run(Context context) throws IOException, InterruptedException {
  setup(context);
  while (context.nextKey()) {
    reduce(context.getCurrentKey(), context.getValues(), context);
  }
  cleanup(context);
}

You should be able to amend the while loop to include your counter as follows:

public void run(Context context) throws IOException, InterruptedException {
  setup(context);
  int count = 0;
  while (context.nextKey() && count++ < 1000) {
    reduce(context.getCurrentKey(), context.getValues(), context);
  }
  cleanup(context);
}

Not that this won't necessarily output the top most records, merely the first 1000 keyed records (and will not work if your reduce implementation outputs more that a single record - in which case you can increment the counter in the reduce method)

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top