Question

Mongo docs state:

The Mongo multikey feature can automatically index arrays of values.

That's nice. But how about sorting based on multikeys? More specifically, how to sort a collection according to array match percentage?

For example, I have a pattern [ 'fruit', 'citrus' ] and a collection, that looks like this:

{
    title: 'Apples',
    tags: [ 'fruit' ]
},

{
    title: 'Oranges',
    tags: [ 'fruit', 'citrus' ]
},

{
    title: 'Potato',
    tags: [ 'vegetable' ]
}

Now, I want to sort the collection according to match percentage of each entry to the tags pattern. Oranges must come first, apples second and potatoes last.

What's the most efficient and easy way to do it?

Was it helpful?

Solution

As of MongoDB 2.1 a similar computation can be done using the aggregation framework. The syntax is something like

db.fruits.aggregate(
     {$match : {tags : {$in : ["fruit", "citrus"]}}}, 
     {$unwind : "$tags"}, 
     {$group : {_id : "$title", numTagMatches : {$sum : 1}}}, 
     {$sort : {numTagMatches : -1}} )

which returns

 {
   "_id" : "Oranges",
   "numTagMatches" : 2
 },
 {
   "_id" : "Apples",
   "numTagMatches" : 1
 }

This should be much faster than the map-reduce method for two reasons. First because the implementation is native C++ rather than javascript. Second, because "$match" will filter out the items which don't match at all (if this is not what you want, you can leave out the "$match" part, and change the "$sum" part to be either 1 or 0 depending on if the tag is equal to "fruit" or "citrus" or neither).

The only caveat here is that mongo 2.1 isn't recommended for production yet. If you're running in production you'll need to wait for 2.2. But if you're just experimenting on your own you can play around with 2.1, as the aggregation framework should be more performant.

OTHER TIPS

Note: The following explanation is required for Mongo 2.0 and earlier. For later versions you should consider the new aggregation framework.

We do something similar while trying to fuzzy-match input sentence which we index. You can use map reduce to emit the object ID every time you get a match and them sum them up. You'll then need to load the results into your client and sort by the highest value first.

db.plants.mapReduce(
    function () {
        var matches = 0;
        for (var i = 0; i < targetTerms.length; i++) {
            var term = targetTerms[i];
            for (var j = 0; j < this.tags.length; j++) {
                matches += Number(term === this.tags[j]);
            }   
        }   
        emit(this._id, matches);
    },  

    function (prev, curr) {
        var result = 0;
        for (var i = 0; i < curr.length; i++) {
            result += curr[i];
        }   
        return result;
    },  

    {   
        out: { inline: 1 },

        scope: {
            targetTerms: [ 'fruit', 'oranges' ],
        }   
    }   
);

You would have you pass your ['fruit', 'citrus' ] input values using the scope parameter in the map reduce call as {targetTerms: ['fruit', 'citrus' ]} so that they are available in the map function above.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top