質問

I have a somewhat complex query that is very critical to my application.

$cur = $col->find(
    array (
        '$or' => array(
            array('owner' => $my_id),
            array('owner' => array('$in' => $friends), 'perm.type' => array('$in' => array('P', 'F'))),
            array('owner' => array('$in' => $friends), 'perm.list' => $my_id)
        )
    )
)->limit(10)->skip(0)->sort(array('ca' => -1));

The intention is to find the first 10 posts, sorted by their create time in desc order, which are:

a). made by myself, or b). made by my friends with permission types of 'P' for public, or'F' for friends, or c). made by my friends which the permission list has specifically designated me as a viewer.

The variable $friends is an array of user ids who are friends with me. perm.type has a total of 4 values, which are 'P', 'F', 'S', 'C'. perm.list is an array of user ids who has permission to view this post.

The above query works as intended in filtering out the correct results. But I ran into problem creating effective indexes on them.

The indexes I have created for this query are:

$col->ensureIndex(array('owner' => 1, 'ca' => -1));
$col->ensureIndex(array('owner' => 1, 'perm.type' => 1, 'ca' => -1));
$col->ensureIndex(array('owner' => 1, 'perm.list' => 1, 'ca' => -1));

The first index is designed for the first part of the query criteria, the 2nd index is designed for the 2nd criteria, and the 3rd is design for the 3rd criteria, and is a multikey index.

A typical post would look like this:

{
    "_id": "...",
    "owner": "001",
    "perm": {
        "type": "P",
        "list": []
    },
    "msg": "Nice dress!",
    "ca": 1390459269
}

Another example:

{
    "_id": "...",
    "owner": "007",
    "perm": {
        "type": "C",
        "list": ["001", "005"]
    },
    "msg": "Nice day!",
    "ca": 1390837209
}

I know of the limitation that existed prior to MongoDB version 2.6, which prevents the indexes being used when combining $or with sort(). The issue according to this http://jira.mongodb.org/browse/SERVER-1205 should have been fixed in 2.6.

And sure enough, explain() now shows the use of my indexes, where it didn't before in 2.4. But when I ran the query, it is now much slower than when it didn't use any indexes. explain() showed the nscanned is way higher than expected. After some searching, I found this issue https://jira.mongodb.org/browse/SERVER-3310 which seems to explain the problem I am experiencing. But as the ticket stated, this issue should have been fixed in 2.5.5, so what is causing my problem here?

I have tried to set up different indexes, compounding them in different orders, even separating them up, checking to see if the new index intersection feature would help. But none worked.

Does anyone know what my problem here is?

Edit After more testing, observing, and thinking, I have narrowed downed the issue, and it is really using $in, limit() and sort() all together in one query that's causing the problem. Adding a top level '$or' just doubles this problem for each '$or' clause. I will explain my logic below:

I have refined the my indexes to the following:

$col->ensureIndex(array('owner._id' => 1, 'ca' => -1, 'perm.type' => 1));
$col->ensureIndex(array('perm.list' => 1, 'ca' => -1, 'owner._id' => 1))

The reasoning behind the first index is when I have millions of records, the query should start looking from the given set of user ids (friends) first to narrow down the choices. Then it goes through it in reverse chronological order of the records to check if each has the right permission type. The problem with this index is that the query optimizer has no idea of how many records it needs to scan in order to satisfy the limit(10) condition. It has no idea where the 10 most recent records will eventually come from, so it has to return up to 10 records from each id specified in the '$in' clause, then repeat the same thing for each '$or'. So if I have two '$or' clauses, each with an '$in' that consist of 100 user ids, then it will have to scan enough records to match 10 records from each of the users in the '$in' of the first '$or', then and 10 records from each of the users in the '$in' of the 2nd '$or', giving a return of 2000 records (this is the n returned in explain, and nscanned will be much higher depending on how many records it needs to scan to find the 2000 matches), and from this 2000 records, all chronologically ordered already, it takes the top 10 to return.

So, what if I build the index in the following order: "'ca' => -1, 'owner._id' => 1, 'perm.type' => 1"? Well, I can't really do that, because when I have hundreds of thousands of users, with millions of records, most records will be irrelevant to the viewer. So if I start from 'ca' => -1 first, it will scan a lot of irrelevant records before hitting one that fits the criteria, even though each hit that it founds will count directly against the limit(10), and it will only need to scan as many records as it needs to match 10 records that meet the criteria. But this scan can be tens of thousands of records, or even more. Worst yet, if 10 records can't be found, it will have to go through the entire index to find this out.

The 2nd index is to look at each record that is designated for me, go through it in reverse chronological order, and check and see if these come from my friends. This is pretty straight forward, and the problem here really is from the combination of using this, with '$in', limit() and sort() from above, all together in one query.

At this point, I am looking into solutions from merging results on the application side, but breaking up the '$or' to do on the application side is easy, but how do I break up the '$in' in the criteria array('owner' => array('$in' => $friends), 'perm.type' => array('$in' => array('P', 'F')))?

役に立ちましたか?

解決 3

After 3 days of testing and research, the reason that is causing the inefficient queries is now clear. MongoDB at current version (2.6.1) is still unable to optimize queries that uses $or, $in, limit() and sort() all at once. The https://jira.mongodb.org/browse/SERVER-1205 and https://jira.mongodb.org/browse/SERVER-3310 fixes, each only improved performance on queries having 3 out of the 4 operations listed above. When introducing a 4th operation into the query, the optimization goes out the window. This behavior is observed with full index and document scans within the $or clause, even though limit(10) is specified.

The attempt to solve this problem by breaking up the $or clauses individually and merge results on the application side, while feasible, ran into major obstacles when I attempted to implement pagination.

My current solution thus, is to come up with an equivalent query to the original query, while only using 3 out of the 4 operations. I decided to 'flatten' the '$in' operator, turn each element within the $friends array into another '$or' condition with an exact owner value to be queried for. So instead of having 3 '$or' conditions in my original query, I now have as many '$or' conditions as I have elements in my $friends array, plus the 2 other original '$or' conditions.

The query is now optimized. When ran with explain(), the nscannedObjects, and nscanned is now way down, to values they are suppose to be. Considering the documentation on '$or' stating

When using indexes with $or queries, each clause of an $or will execute in parallel. These clauses can each use their own index.

This may actually be an acceptable solution performance-wise. I hope this will helps anyone who ran into the same problems I did.

他のヒント

I'm not sure if this is a bug in MongoDB 2.6 but you can take a look at this article about index creation.

The order of fields in an index should be:

1. First, fields on which you will query for exact values.
2. Second, fields on which you will sort.
3. Finally, fields on which you will query for a range of values.

So following that advice, you can try with this indexes:

$col->ensureIndex(array('owner' => 1, 'ca' => -1));
$col->ensureIndex(array('ca' => -1, 'owner' => 1, 'perm.type' => 1));
$col->ensureIndex(array('perm.list' => 1, 'ca' => -1, 'owner' => 1));

Edit:

From your explain, if you're testing on small data sets, full collection is fast because MongoDB doesn't need to go through a lot of documents. You should try to do a test with e.g 10000 documents to see a real difference. Values for your fields in indexes should be different enough to ensure index selectivity for your queries (e.g. not all documents are from the same owner).

TL;DR: I believe you are using the wrong algorithm/data structure for the tool, or vice-versa. I'd suggest using a fan-out approach as discussed in this SO question, or my blog post. Sorry for shamelessly advertising my previous posts, but it doesn't make sense to repeat that info here.


The philosophy of MongoDB is, contrary to the typical SQL-philosophy, to be rather write-heavy. You are essentially trying to implement a ranking algorithm in a MongoDB query, but MongoDB's query philosophy is "query by example". That's not a good fit.

Sure, the aggregation pipeline is doesn't fit that philosophy anymore, and things maybe will change. There are optimizations on the way that allow for more complex queries, such as index intersection.

Still, what you are doing here is very hard to control. You not only want MongoDB to use index intersection (new in 2.6, only works with two indexes currently), but you're also combining it with $in queries and compound indexes. That's a lot to ask, and if the number of friends in the $in grows too much, you're all outta luck anyhow. The same is true if a piece of news is shared with too many people, worst case a document grows past 16MB. Growing documents is expensive, complex queries are expensive, large documents are expensive too.

I suggest you use a fan-out approach for newsfeeds where you can implement a very complex ranking algorithm in code, rather than in MongoDB.

I'm not saying it's impossible to optimize your query, but since explain's output is so ginourmous and there are so many effects interacting here (typical array sizes, typical match ratio, selectivity of the indexes, etc.), it will be very hard to find a good solution for this problem, even for someone who has full access to the data (i.e. you).

Even if you get this to work, you might run into critical problems if your access patterns change, the data changes, etc., so you'll be dealing with a fragile construct.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top