Question

I'm using mongodb, and have a model that adds comments as embedded documents.

How do I get the average age of comments for an entry? (relative example, my fields vary a little)

So I can have many comments for an entry, and I need to find out the average age of a comment, or the average :cal_date. Additional metrics would be great to gather like the max :cal_date for all entries/comments or per entry...

Does this make sense? Need more detail? I'm happy to oblige to get the solution. I've been confused with date calculations for a while now.

Another way to think of this is using the library book model: There are many books and each book has many checkouts/ins. I need to find the average time that each book is checked out and the average time that all books are out. Again, just metrics, but the fact that these are all dates is confusing.

{
  _id: ObjectId("51b0d94c3f72fb89c9000014"),
  barcode: "H-131887",
  comments: [
    {
      _id: ObjectId("51b0d94c3f72fb89c9000015"),
      cal_date: ISODate("2013-07-03T16:04:57.893Z"),
      cal_date_due: ISODate("2013-07-03T16:04:57.894Z")
    },
    {
      _id: ObjectId("51b0e6053f72fbb27900001b"),
      cal_date: ISODate("2012-07-03T19:39:43.074Z"),
      cal_date_due: ISODate("2013-07-03T19:39:43.076Z"),
      updated_at: ISODate("2013-06-06T19:41:57.770Z"),
      created_at: ISODate("2013-06-06T19:41:57.770Z")
    }
  ],
  created_at: ISODate("2013-06-06T18:47:40.481Z"),
  creator_id: ObjectId("5170547c791e4b1a16000001"),
  description: "",
  maker: "MITUTOYO",
  model: "2046S",
  serial: "QEL228",
  status: "Out",
  updated_at: ISODate("2013-06-07T18:54:38.340Z")
}

One more thing How do I include additional fields in my output using $push? I can get this to work, but it includes, say barcode, twice in an array "barcode" => ["H-131887", "H-131887"]

Was it helpful?

Solution

You didn't say what time units you want the age in, but I'm just going to show you how to get it back in minutes and trust you can work out how to convert that to any other time grain. I'm going to assume original documents have schema like this:

{ _id: xxx,
  post_id: uniqueId,
  comments: [ { ..., date: ISODate() }, ..., { ... , date: ISODate() } ],
  ...
}

Now the aggregation:

// first you want to define some fixed point in time that you are calculating age from.
// I'm going to use a moment just before "now"
var now = new Date()-1
// unwind the comments array so you can work with individual comments
var unwind = {$unwind:"$comments"};
// calculate a new comment_age value
var project = {$project: {
       post_id:1, 
       comment_age: {
           $divide:[ 
               {$subtract:[now, "$comments.date"]},
               60000
           ]
       }
} };
// group back by post_id calculating average age of comments
var group = {$group: {
               _id: "$post_id",
               age: {$avg: "$comment_age"}
            } };
// now do the aggregation:

db.coll.aggregate( unwind, project, group )

You can use $max, $min, and other grouping function to find oldest and newest comment date or lowest/highest comment age. You can group by post_id or you can group by constant to find these calculations for the entire collection, etc.

* edit * Using the document you included for "library book" as example, this might be the pipeline to calculate for each book that's currently "Out" how long it's been out for, assuming that "comments.cal_date" is when it was checked out and that latest cal_date of all the comments represents the current "check-out" (the older ones having been returned):

 db.coll.aggregate( [
    { $match  : { status : "Out"  } },
    { $unwind : "$comments" },
    { $group  : { _id : "$_id", 
                  cal_date : { $max : "$comments.cal_date" } 
                } 
    },
    { $project : { outDuration : { $divide : [ 
                                     { $subtract : [ 
                                                     ISODate("2013-07-15"), 
                                                     "$cal_date" 
                                                   ] 
                                     },
                                     24*60*60*1000 
                                    ] 
                                  }
                  } 
    },
    { $group : { _id : 1, 
                 avgOut : { $avg : "$outDuration" } 
               } 
    } 
 ] )

What the steps are doing:

  • filtering out documents based on status to make calculation about books that are currently Out only.
  • $unwind to flatten out the "comments" array so that we can
  • find which entry is the latest cal_date with $group and $max.
  • use this max cal_date (which represents when the book was checked out) to subtract it from today's date and divide the result by number of milliseconds in a day to get number of days this book has been out
  • $group all the results together to find the average number of days all the checked-out books have been out.

* edit * I was assuming you knew Ruby and just needed to know how to do an aggregation framework command to calculate date differences/averages/etc. Here is the same code in Ruby using "now" to compare cal_date to (you can also do it using a constant date value:

# get db collection from MongoClient into variable 'coll'
# see basic MongoDB Ruby driver tutorial for details
coll.aggregate([ 
   { "$match"  => {"status"=>"Out"} }, 
   { "$unwind" => "$comments"}, 
   { "$group"  => { "_id" => "$_id", "cal_date" => { "$max" => "$comments.cal_date" } } },
   { "$project"=> {
                    "outDuration" => { 
                       "$divide" => [ 
                            {"$subtract" => [ Time.now, "$cal_date" ] }, 
                            24*60*60*1000
                       ]
                    }
                  }
   },
   { "$group"  => {
          "_id"    => 1,
          "avgOut" => {"$avg"=>"$outDuration"}
     }
   }  
])

See https://github.com/mongodb/mongo-ruby-driver/wiki/Aggregation-Framework-Examples for more examples and explanations.

If there are additional fields that you want to preserve in your $group phase you can add more fields by changing the pipeline step like this:

    { $group  : { _id : "$_id", 
                  barcode  : { $first : "$barcode" },
                  cal_date : { $max : "$comments.cal_date" } 
                } 
    } 

If you don't need the original _id you can just use "$barcode" instead of "$_id" in the first line (that is _id: "$barcode") but since there may be multiple fields you want to preserve, $first trick works with as many of them as you want to keep.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top