MongoDB : Aggregation framework : Get last dated document per grouping ID

Question 1

To directly answer your question, yes it is the most efficient way. But I do think we need to clarify why this is so.

As was suggested in alternatives, the one thing people are looking at is "sorting" your results before passing to a $group stage and what they are looking at is the "timestamp" value, so you would want to make sure that everything is in "timestamp" order, so hence the form:

db.temperature.aggregate([
    { "$sort": { "station": 1, "dt": -1 } },
    { "$group": {
        "_id": "$station", 
        "result": { "$first":"$dt"}, "t": {"$first":"$t"} 
    }}
])

And as stated you will of course want an index to reflect that in order to make the sort efficient:

However, and this is the real point. What seems have been overlooked by others ( if not so for yourself ) is that all of this data is likely being inserted already in time order, in that each reading is recorded as added.

So the beauty of this is the the _id field ( with a default ObjectId ) is already in "timestamp" order, as it does itself actually contain a time value and this makes the statement possible:

db.temperature.aggregate([
    { "$group": {
        "_id": "$station", 
        "result": { "$last":"$dt"}, "t": {"$last":"$t"} 
    }}
])

And it is faster. Why? Well you don't need to select an index ( additional code to invoke) you also don't need to "load" the index in addition to the document.

We already know the documents are in order ( by _id ) so the $last boundaries are perfectly valid. You are scanning everything anyway, and you could also "range" query on the _id values as equally valid for between two dates.

The only real thing to say here, is that in "real world" usage, it might just be more practical for you to $match between ranges of dates when doing this sort of accumulation as opposed to getting the "first" and "last" _id values to define a "range" or something similar in your actual usage.

So where is the proof of this? Well it is fairly easy to reproduce, so I just did so by generating some sample data:

var stations = [ 
    "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL",
    "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA",
    "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE",
    "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK",
    "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT",
    "VA", "WA", "WV", "WI", "WY"
];


for ( i=0; i<200000; i++ ) {

    var station = stations[Math.floor(Math.random()*stations.length)];
    var t = Math.floor(Math.random() * ( 96 - 50 + 1 )) +50;
    dt = new Date();

    db.temperatures.insert({
        station: station,
        t: t,
        dt: dt
    });

}

On my hardware (8GB laptop with spinny disk, which is not stellar, but certainly adequate) running each form of the statement clearly shows a notable pause with the version using an index and a sort ( same keys on index as the sort statement). It is only a minor pause, but the difference is significant enough to notice.

Even looking at the explain output ( version 2.6 and up, or actually is there in 2.4.9 though not documented ) you can see the difference in that, though the $sort is optimized out due to the presence of an index, the time taken appears to be with index selection and then loading the indexed entries. Including all fields for a "covered" index query makes no difference.

Also for the record, purely indexing the date and only sorting on the date values gives the same result. Possibly slightly faster, but still slower than the natural index form without the sort.

So as long as you can happily "range" on the first and last _id values, then it is true that using the natural index on the insertion order is actually the most efficient way to do this. Your real world mileage may vary on whether this is practical for you or not and it might simply end up being more convenient to implement the index and sorting on the date.

But if you were happy with using _id ranges or greater than the "last" _id in your query, then perhaps one tweak in order to get the values along with your results so you can in fact store and use that information in successive queries:

db.temperature.aggregate([
    // Get documents "greater than" the "highest" _id value found last time
    { "$match": {
        "_id": { "$gt":  ObjectId("536076603e70a99790b7845d") }
    }},

    // Do the grouping with addition of the returned field
    { "$group": {
        "_id": "$station", 
        "result": { "$last":"$dt"},
        "t": {"$last":"$t"},
        "lastDoc": { "$last": "$_id" } 
    }}
])

And if you were actually "following on" the results like that then you can determine the maximum value of ObjectId from your results and use it in the next query.

Anyhow, have fun playing with that, but again Yes, in this case that query is the fastest way.

Question 2

An index is all you really need:

db.temperature.ensureIndex({ 'station': 1, 'dt': 1 })
for s in db.temperature.distinct('station'):
    db.temperature.find({ station: s }).sort({ dt : -1 }).limit(1)

of course using whatever syntax is actually valid for your language.

Edit: You are correct that a loop like this incurs a round-trip per station, and it's great for a few stations, and not so good for 1000. You do still want the compound index on station+dt, though, and to take advantage of a descending sort:

db.temperature.aggregate([
    { $sort: { station: 1, dt: -1 } },
    { $group: { _id: "$station", result: {$first:"$dt"}, t: {$first:"$t"} } }
])

Question 3

As far as the aggregation query you've posted, I'd make certain that you have an index on dt:

db.temperature.ensureIndex({'dt': 1 })

This will make certain that the $sort at the beginning of the aggregation pipeline is as efficient as possible.

As to whether or not this is the most efficient way to get this data, vs. a query in a loop, will likely be a function of how many data points you have. In the beginning, with "thousands of stations" and perhaps hundreds of thousands of data points I'd think the aggregation approach will be faster.

However, as you add more and more data an issue is that the aggregation query will continue to touch all the documents. This will get increasingly expensive as you scale up to millions or more documents. One approach for that case would be to add a $limit right after the $sort to limit the total number of documents being considered. That's a bit hacky and inexact but it would help to limit the total number of documents that need to be accessed.