Question

I have a collection full of documents with a created_date attribute. I'd like to send these documents through an aggregation pipeline to do some work on them. Ideally I would like to filter them using a $match before I do any other work on them so that I can take advantage of indexes however I can't figure out how to use the new $year/$month/$dayOfMonth operators in my $match expression.

There are a few examples floating around of how to use the operators in a $project operation but I'm concerned that by placing a $project as the first step in my pipeline then I've lost access to my indexes (MongoDB documentation indicates that the first expression must be a $match to take advantage of indexes).

Sample data:

{
    post_body: 'This is the body of test post 1',
    created_date: ISODate('2012-09-29T05:23:41Z')
    comments: 48
}
{
    post_body: 'This is the body of test post 2',
    created_date: ISODate('2012-09-24T12:34:13Z')
    comments: 10
}
{
    post_body: 'This is the body of test post 3',
    created_date: ISODate('2012-08-16T12:34:13Z')
    comments: 10
}

I'd like to run this through an aggregation pipeline to get the total comments on all posts made in September

{
    aggregate: 'posts',
    pipeline: [
         {$match:
             /*Can I use the $year/$month operators here to match Sept 2012?
             $year:created_date : 2012,
             $month:created_date : 9
             */
             /*or does this have to be 
             created_date : 
                  {$gte:{$date:'2012-09-01T04:00:00Z'}, 
                  $lt: {$date:'2012-10-01T04:00:00Z'} }
             */
         },
         {$group:
             {_id: '0',
              totalComments:{$sum:'$comments'}
             }
          }
    ]
 }

This works but the match loses access to any indexes for more complicated queries:

{
    aggregate: 'posts',
    pipeline: [
         {$project:
              {
                   month : {$month:'$created_date'},
                   year : {$year:'$created_date'}
              }
         },
         {$match:
              {
                   month:9,
                   year: 2012
               }
         },
         {$group:
             {_id: '0',
              totalComments:{$sum:'$comments'}
             }
          }
    ]
 }
Was it helpful?

Solution

As you already found, you cannot $match on fields that are not in the document (it works exactly the same way that find works) and if you use $project first then you will lose the ability to use indexes.

What you can do instead is combine your efforts as follows:

{
    aggregate: 'posts',
    pipeline: [
         {$match: {
             created_date : 
                  {$gte:{$date:'2012-09-01T04:00:00Z'}, 
                  $lt:  {date:'2012-10-01T04:00:00Z'} 
                  }}
             }
         },
         {$group:
             {_id: '0',
              totalComments:{$sum:'$comments'}
             }
          }
    ]
 }

The above only gives you aggregation for September, if you wanted to aggregate for multiple months, you can for example:

{
    aggregate: 'posts',
    pipeline: [
         {$match: {
             created_date : 
                  { $gte:'2012-07-01T04:00:00Z', 
                    $lt: '2012-10-01T04:00:00Z'
                  }
         },
         {$project: {
              comments: 1,
              new_created: {
                        "yr" : {"$year" : "$created_date"},
                        "mo" : {"$month" : "$created_date"}
                     }
              }
         },
         {$group:
             {_id: "$new_created",
              totalComments:{$sum:'$comments'}
             }
          }
    ]
 }

and you'll get back something like:

{
    "result" : [
        {
            "_id" : {
                "yr" : 2012,
                "mo" : 7
            },
            "totalComments" : 5
        },
        {
            "_id" : {
                "yr" : 2012,
                "mo" : 8
            },
            "totalComments" : 19
        },
        {
            "_id" : {
                "yr" : 2012,
                "mo" : 9
            },
            "totalComments" : 21
        }
    ],
    "ok" : 1
}

OTHER TIPS

Let's look at building some pipelines that involve operations that are already familiar to us. So, we're going to look at the following stages:

  • match - this is filtering stage, similar to find.
  • project
  • sort
  • skip
  • limit

We might ask ourself why these stages are necessary, given that this functionality is already provided in the MongoDB query language, and the reason is because we need these stages to support the more complex analytics-oriented functionality that's included with the aggregation framework. The below query is simply equal to a find:


db.companies.aggregate([{
  $match: {
    founded_year: 2004
  }
}, ])

Let's introduce a project stage in this aggregation pipeline:


db.companies.aggregate([{
  $match: {
    founded_year: 2004
  }
}, {
  $project: {
    _id: 0,
    name: 1,
    founded_year: 1
  }
}])

We use aggregate method for implementing aggregation framework. The aggregation pipelines are merely an array of documents. Each of the document should stipulate a particular stage operator. So, in the above case we've an aggregation pipeline with two stages. The $match stage is passing the documents one at a time to $project stage.

Let's extend to limit stage:


db.companies.aggregate([{
  $match: {
    founded_year: 2004
  }
}, {
  $limit: 5
}, {
  $project: {
    _id: 0,
    name: 1
  }
}])

This gets the matching documents and limits to five before projecting out the fields. So, projection is working only on 5 documents. Assume, if we were to do something like this:


db.companies.aggregate([{
  $match: {
    founded_year: 2004
  }
}, {
  $project: {
    _id: 0,
    name: 1
  }
}, {
  $limit: 5
}])

This gets the matching documents and projects those large number of documents and finally limits to five. So, projection is working on large number of documents and finally limiting to 5. This gives us a lesson that we should limit the documents to those which are absolutely necessary to be passed to the next stage. Now, let's look at sort stage:


db.companies.aggregate([{
  $match: {
    founded_year: 2004
  }
}, {
  $sort: {
    name: 1
  }
}, {
  $limit: 5
}, {
  $project: {
    _id: 0,
    name: 1
  }
}])

This will sort all documents by name and give only 5 out of them. Assume, if we were to do something like this:


db.companies.aggregate([{
  $match: {
    founded_year: 2004
  }
}, {
  $limit: 5
}, {
  $sort: {
    name: 1
  }
}, {
  $project: {
    _id: 0,
    name: 1
  }
}])

This will take first 5 documents and sort them. Let's add the skip stage:


db.companies.aggregate([{
  $match: {
    founded_year: 2004
  }
}, {
  $sort: {
    name: 1
  }
}, {
  $skip: 10
}, {
  $limit: 5
}, {
  $project: {
    _id: 0,
    name: 1
  }
}, ])

This will sort all the documents and skip the initial 10 documents and return to us. We should try to include $match stages as early as possible in the pipeline. To filter documents using a $match stage, we use the same syntax for constructing query documents (filters) as we do for find().

Try this;

db.createCollection("so");
db.so.remove();
db.so.insert([
{
    post_body: 'This is the body of test post 1',
    created_date: ISODate('2012-09-29T05:23:41Z'),
    comments: 48
},
{
    post_body: 'This is the body of test post 2',
    created_date: ISODate('2012-09-24T12:34:13Z'),
    comments: 10
},
{
    post_body: 'This is the body of test post 3',
    created_date: ISODate('2012-08-16T12:34:13Z'),
    comments: 10
}
]);
//db.so.find();

db.so.ensureIndex({"created_date":1});
db.runCommand({
    aggregate:"so",
    pipeline:[
        {
            $match: { // filter only those posts in september
                created_date: { $gte: ISODate('2012-09-01'), $lt: ISODate('2012-10-01') }
            }
        },
        {
            $group: {
                _id: null, // no shared key
                comments: { $sum: "$comments" } // total comments for all the posts in the pipeline
            }
        },
]
//,explain:true
});

Result is;

{ "result" : [ { "_id" : null, "comments" : 58 } ], "ok" : 1 }

So you could also modify your previous example to do this, although I'm not sure why you'd want to, unless you plan on doing something else with month and year in the pipeline;

{
    aggregate: 'posts',
    pipeline: [
     {$match: { created_date: { $gte: ISODate('2012-09-01'), $lt: ISODate('2012-10-01') } } },
     {$project:
          {
               month : {$month:'$created_date'},
               year : {$year:'$created_date'}
          }
     },
     {$match:
          {
               month:9,
               year: 2012
           }
     },
     {$group:
         {_id: '0',
          totalComments:{$sum:'$comments'}
         }
      }
    ]
 }
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top