Question

The problem is that given documents with two arrays each containing documents as their elements that I want to find the documents that essentially have:

"obj1.a" === "obj2.b"

So given the sample documents, but actually expecting much larger arrays, then how do do this?:

{
    "obj1": [
        { "a": "a", "b": "b" },
        { "a": "a", "b": "c" }
    ],
    "obj2": [
        { "a": "c", "b": "b" },
        { "a": "c", "b": "c" }
    ]
},
{
    "obj1": [
        { "a": "a", "b": "b" }
    ],
    "obj2": [
        { "a": "a", "b": "a" }
    ]
}

One approach might be to compare these with JavaScript and the $where operator, but looping large arrays from within JavaScript doesn't sound very favorable.

Another approach is using the aggregation framework to do the comparison, but this involves unwinding two arrays on top of each other which can create a lot of documents to be processed in the pipeline:

db.objects.aggregate([
    { "$unwind": "$obj1" },
    { "$unwind": "$obj2" },
    { "$project": {
        "match": { "$eq": [ "$obj1.a", "$obj2.b" ] }
    }},
    { "$group": {
        "_id": "$_id",
        "match": { "$max": "$match" }
    }},
    { "$match": { "match": true } }
])

Where performance is a concern it is easy to see how the number of documents actually processing through $project and $group can end up many times larger than the original documents in the collection.

So in order to do this there has to be some way of comparing the array elements without needing to perform an $unwind on those arrays and end up grouping the documents back together. How could this be done?

Was it helpful?

Solution

You can get this sort of result using the $map operator that was introduced in MongoDB 2.6. This operates by taking an input array and allowing an expression to be evaluated over each element producing a new array as the result:

db.objects.aggregate([
    { "$project": {
        "match": {
            "$size": {
                "$setIntersection": [
                    { "$map": {
                        "input": "$obj1",
                        "as": "el",
                        "in": { "$concat": ["$$el.a",""] }
                    }},
                    { "$map": {
                        "input": "$obj2",
                        "as": "el",
                        "in": { "$concat": ["$$el.b",""] }
                    }}
                ]
            }
        }
    }},
    { "$match": { "match": { "$gte": 1 } } }
])

Here this is used with the $setIntersection and $size operators. As the $map returns just the property values from the elements that you want to compare you end up with two arrays just containing those values.

The only this is that the "in" option for $map currently requires an operator to be present within the Object {} notation of it's arguments. You cannot presently say:

"in": "$$el.a"

To get around this we are using $concat to join the string value with an empty string. Other operators can be used for different types of even $ifNull which would be fairly generic and gets around "type" problems

"in": { "$ifNull": [ "$$el.a", false ] }

The $setIntersection that wraps these, is used to determine which values of those "sets" are the same and returns it's result as another array containing only the matching values.

Finally the $size operator here is an aggregation method that returns the actual "size" of the array as an integer. So this can be used in the following $match to then filter out any results that did not return a "size" value of 1 or greater.

Essentially this does all the work that was done in four individual stages, where the first two are exponentially growing the number of documents to be processed, within two simple passes, all without increasing the number of documents that were received as input.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top