Question

Just want to find the optimal way for aggregation and not sure how I need work with indexing while aggregation. If someone has experience with that and probably can share ideas or experience...

Situation:

  • MondoDB collection with millions of records, let it be some logs (around 3-5 millions per day)
  • All realisation done with Java 7 and Mongo aggregation
  • Log record in Mongo collection looks like this:
     {
          "_id": "",
          "timestamp": "",
          "userId": "",
          "userIp": "",
          "country": "",
          "city": "",
          "applicationName": ""
     }
  • I have different reports based on log data. I need create reports almost by each field and fields combinations even more all aggregation should be done for Daily/Weekly/Monthly

Question: How I should work with indexing? And what the best way of creating reports with such data for your opinion?

Was it helpful?

Solution

So for index deployment to optimize you want the following indexes created, or otherwise specified with the equivalent @CompoundIndexes annotation on your class:

db.collection.ensureIndex({ 
    "timestamp": 1, "userId": 1
})

db.collection.ensureIndex({
    "timestamp": 1, "applicationName": 1, "country": 1
})

That comes from your comments for intended usage, so 2 indexes are required in total.

Also to mention that you want your "timestamp" values to be BSON Dates, in that way you get the date aggregation operators that are important to your actual queries. Just using the shell JavaScript form here for general reference:

db.collection.aggregate([
    // Using the index that was created
    { "$match": {
        "timestamp": { 
           "$gte": new Date("2014-04-01"), "$lt": new Date("2014-05-01")
        },
        "userId": { "$gte": "lowervalue", "$lte: "uppervalue" }
    }},

    // Grouping Data
    { "$group": {
        "_id": {
            "y": { "$year": "$timestamp" },
            "m": { "$month": "$timestamp" },
            "d": { "$day": "$timestamp" }
        },
        "someField": { "$sum": "$someField" },
        "otherField": { "$avg": "$otherField" }
    }}
])

So it is the "date aggregation operators" that allow you to split that BSON date into the components that you want (in this case day) so that all the timestamp values contained within those boundaries are subject to the other aggregation operations on the other fields that you have.

Please note that the indexes can only ever be used in the initial $match stage of the aggregation pipeline, so this is importantly where you select your data and reduce your working set. But if you do things this way then you will be getting the maximum performance possible from your data.

For further gains, consider "pre-aggregating" information in other collections, based on periodically running the base forms of aggregation over the raw "log" data that you have.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top