Question

I have an Item model which has an attribute category. I want the items count grouped by category. I wrote a map reduce for this functionality. It was working fine. I recently wrote a script to create 5000 items. Now I realize my map reduce only gives the result for the last 80 records. The following is the code for the mapreduce function.

map = %Q{
  function(){
    emit({},{category: this.category});
  }
}

reduce = %Q{
  function(key, values){
    var category_count = {};
    values.forEach(function(value){
      if(category_count.hasOwnProperty(value.category))
        category_count[value.category]++;  
      else
        category_count[value.category] = 1 
    })
    return category_count;
  }
}

Item.map_reduce(map,reduce).out(inline: true).first.try(:[],"value")

After researching a bit and I discovered mongodb invokes reduce function multiple times. How can achieve the functionality I intended for?

Was it helpful?

Solution

There is a rule you must follow when writing map-reduce code in MongoDB (a few rules, actually). One is that the emit (which emits key/value pairs) must have the same format for the value that your reduce function will return.

If you emit(this.key, this.value) then reduce must return the exact same type that this.value has. If you emit({},1) then reduce must return a number. If you emit({},{category: this.category}) then reduce must return the document of format {category:"string"} (assuming category is a string).

So that clearly can't be what you want, since you want totals, so let's look at what reduce is returning and work out from that what you should be emitting.

It looks like at the end you want to accumulate a document where there is a keyname for each category and its value is a number representing the number of its occurrences. Something like:

{category_name1:total, category_name2:total}

If that's the case then the correct map function would emit({},{"this.category":1}) in which case your reduce will need to add up the numbers for each key corresponding to a category.

Here is what the map should look like:

map=function (){
     category = { };
     category[this.category]=1;
     emit({},category);
}

And here is the correct corresponding reduce:

reduce=function (key,values) {
     var category_count = {};
     values.forEach(function(value){
        for (cat in value) {
           if( !category_count.hasOwnProperty(cat) ) category_count[cat]=0;
           category_count[cat] += value[cat];
        }
     });
     return category_count;
}

Note that it satisfies two other requirements for MapReduce - it works correctly if the reduce function is never called (which will be the case if there is only one document in your collection) and it will work correctly if the reduce function gets called multiple times (which is what's happening when you have more than 100 documents).

A more conventional way to do that would be to emit category name as key and the number as value. This simplifies map and reduce:

map=function() { 
   emit(this.category, 1);
}

reduce=function(key,values) {
    var count=0;
    values.forEach(function(val) {
        count+=val;
    }
    return count;
}

This will sum the number of times each category appears. This also satisfies requirements for MapReduce - it works correctly if the reduce function is never called (which will be the case for any category that only appears once) and it will work correctly if the reduce function gets called multiple times (which will happen if any category appears more than 100 times).

As others pointed out, aggregation framework makes the same exercise much simpler with:

db.collection.aggregate({$group:{_id:"$category",count:{$sum:1}}})

although that matches the format of the second mapReduce I showed, and not the original format that you had which is outputting category names as keys. However aggregation framework will always be significantly faster than MapReduce.

OTHER TIPS

I agree with Neil Lunn's comment.

What I can see from the info that is provided is that if you are on a version of MongoDB greater or equal than 2.2 you can use the aggregation framework instead of map-reduce.

db.items.aggregate([
  { $group: { _id: '$category', category_count: { $sum: 1 } }
])

Which is a lot simpler and performant (see Map/Reduce vs. Aggregation Framework )

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top