MongoDB Nested Array Intersection Query

Question

There are a couple of ways to do this using the aggregation framework

Just a simple set of data for example:

{
    "_id" : ObjectId("538181738d6bd23253654690"),
    "movies": [
        { "_id": 1, "rating": 5 },
        { "_id": 2, "rating": 6 },
        { "_id": 3, "rating": 7 }
    ]
},
{
    "_id" : ObjectId("538181738d6bd23253654691"),
    "movies": [
        { "_id": 1, "rating": 5 },
        { "_id": 4, "rating": 6 },
        { "_id": 2, "rating": 7 }
    ]
},
{
    "_id" : ObjectId("538181738d6bd23253654692"),
    "movies": [
        { "_id": 2, "rating": 5 },
        { "_id": 5, "rating": 6 },
        { "_id": 6, "rating": 7 }
    ]
}

Using the first "user" as an example, now you want to find if any of the other two users have at least two of the same movies.

For MongoDB 2.6 and upwards you can simply use the $setIntersection operator along with the $size operator:

db.users.aggregate([

    // Match the possible documents to reduce the working set
    { "$match": {
        "_id": { "$ne": ObjectId("538181738d6bd23253654690") },
        "movies._id": { "$in": [ 1, 2, 3 ] },
        "$and": [
            { "movies": { "$not": { "$size": 1 } } }
        ]
    }},

    // Project a copy of the document if you want to keep more than `_id`
    { "$project": {
        "_id": {
            "_id": "$_id",
            "movies": "$movies"
        },
        "movies": 1,
    }},

    // Unwind the array
    { "$unwind": "$movies" },

    // Build the array back with just `_id` values
    { "$group": {
        "_id": "$_id",
        "movies": { "$push": "$movies._id" }
    }},

    // Find the "set intersection" of the two arrays
    { "$project": {
        "movies": {
            "$size": {
                "$setIntersection": [
                   [ 1, 2, 3 ],
                   "$movies"
                ]
            }
        }
    }},

    // Filter the results to those that actually match
    { "$match": { "movies": { "$gte": 2 } } }

])

This is still possible in earlier versions of MongoDB that do not have those operators, just using a few more steps:

db.users.aggregate([

    // Match the possible documents to reduce the working set
    { "$match": {
        "_id": { "$ne": ObjectId("538181738d6bd23253654690") },
        "movies._id": { "$in": [ 1, 2, 3 ] },
        "$and": [
            { "movies": { "$not": { "$size": 1 } } }
        ]
    }},

    // Project a copy of the document along with the "set" to match
    { "$project": {
        "_id": {
            "_id": "$_id",
            "movies": "$movies"
        },
        "movies": 1,
        "set": { "$cond": [ 1, [ 1, 2, 3 ], 0 ] }
    }},

    // Unwind both those arrays
    { "$unwind": "$movies" },
    { "$unwind": "$set" },

    // Group back the count where both `_id` values are equal
    { "$group": {
        "_id": "$_id",
        "movies": {
           "$sum": {
               "$cond":[
                   { "$eq": [ "$movies._id", "$set" ] },
                   1,
                   0
               ]
           }
        } 
    }},

    // Filter the results to those that actually match
    { "$match": { "movies": { "$gte": 2 } } }
])

In Detail

That may be a bit to take in, so we can take a look at each stage and break those down to see what they are doing.

$match : You do not want to operate on every document in the collection so this is an opportunity to remove the items that are not possibly matches even if there still is more work to do to find the exact ones. So the obvious things are to exclude the same "user" and then only match the documents that have at least one of the same movies as was found for that "user".

The next thing that makes sense is to consider that when you want to match n entries then only documents that have a "movies" array that is larger than n-1 can possibly actually contain matches. The use of $and here looks funny and is not required specifically, but if the required matches were 4 then that actual part of the statement would look like this:

        "$and": [
            { "movies": { "$not": { "$size": 1 } } },
            { "movies": { "$not": { "$size": 2 } } },
            { "movies": { "$not": { "$size": 3 } } }
        ]

So you basically "rule out" arrays that are not possibly long enough to have n matches. Noting here that this $size operator in the query form is different to $size for the aggregation framework. There is no way for example to use this with an inequality operator such as $gt is it's purpose is to specifically match the requested "size". Hence this query form to specify all of the possible sizes that are less than.

$project : There are a few purposes in this statement, of which some differ depending on the MongoDB version you have. Firstly, and optionally, a document copy is being kept under the _id value so that these fields are not modified by the rest of the steps. The other part here is keeping the "movies" array at the top of the document as a copy for the next stage.

What is also happening in the version presented for pre 2.6 versions is there is an additional array representing the _id values for the "movies" to match. The usage of the $cond operator here is just a way of creating a "literal" representation of the array. Funny enough, MongoDB 2.6 introduces an operator known as $literal to do exactly this without the funny way we are using $cond right here.

$unwind : To do anything further the movies array needs to be unwound as in either case it is the only way to isolate the existing _id values for the entries that need to be matched against the "set". So for the pre 2.6 version you need to "unwind" both of the arrays that are present.

$group : For MongoDB 2.6 and greater you are just grouping back to an array that only contains the _id values of the movies with the "ratings" removed.

Pre 2.6 since all values are presented "side by side" ( and with lots of duplication ) you are doing a comparison of the two values to see if they are the same. Where that is true, this tells the $cond operator statement to return a value of 1 or 0 where the condition is false. This is directly passed back through $sum to total up the number of matching elements in the array to the required "set".

$project: Where this is the different part for MongoDB 2.6 and greater is that since you have pushed back an array of the "movies" _id values you are then using $setIntersection to directly compare those arrays. As the result of this is an array containing the elements that are the same, this is then wrapped in a $size operator in order to determine how many elements were returned in that matching set.

$match: Is the final stage that has been implemented here which does the clear step of matching only those documents whose count of intersecting elements was greater than or equal to the required number.

Final

That is basically how you do it. Prior to 2.6 is a bit clunkier and will require a bit more memory due to the expansion that is done by duplicating each array member that is found by all of the possible values of the set, but it still is a valid way to do this.

All you need to do is apply this with the greater n matching values to meet your conditions, and of course make sure your original user match has the required n possibilities. Otherwise just generate this on n-1 from the length of the "user's" array of "movies".