MongoDB 2.6 Index set up, query using $or, $in, with limit and sort

Question 1

After 3 days of testing and research, the reason that is causing the inefficient queries is now clear. MongoDB at current version (2.6.1) is still unable to optimize queries that uses $or, $in, limit() and sort() all at once. The https://jira.mongodb.org/browse/SERVER-1205 and https://jira.mongodb.org/browse/SERVER-3310 fixes, each only improved performance on queries having 3 out of the 4 operations listed above. When introducing a 4th operation into the query, the optimization goes out the window. This behavior is observed with full index and document scans within the $or clause, even though limit(10) is specified.

The attempt to solve this problem by breaking up the $or clauses individually and merge results on the application side, while feasible, ran into major obstacles when I attempted to implement pagination.

My current solution thus, is to come up with an equivalent query to the original query, while only using 3 out of the 4 operations. I decided to 'flatten' the '$in' operator, turn each element within the $friends array into another '$or' condition with an exact owner value to be queried for. So instead of having 3 '$or' conditions in my original query, I now have as many '$or' conditions as I have elements in my $friends array, plus the 2 other original '$or' conditions.

The query is now optimized. When ran with explain(), the nscannedObjects, and nscanned is now way down, to values they are suppose to be. Considering the documentation on '$or' stating

When using indexes with $or queries, each clause of an $or will execute in parallel. These clauses can each use their own index.

This may actually be an acceptable solution performance-wise. I hope this will helps anyone who ran into the same problems I did.

Question 2

I'm not sure if this is a bug in MongoDB 2.6 but you can take a look at this article about index creation.

The order of fields in an index should be:

1. First, fields on which you will query for exact values.
2. Second, fields on which you will sort.
3. Finally, fields on which you will query for a range of values.

So following that advice, you can try with this indexes:

$col->ensureIndex(array('owner' => 1, 'ca' => -1));
$col->ensureIndex(array('ca' => -1, 'owner' => 1, 'perm.type' => 1));
$col->ensureIndex(array('perm.list' => 1, 'ca' => -1, 'owner' => 1));

Edit:

From your explain, if you're testing on small data sets, full collection is fast because MongoDB doesn't need to go through a lot of documents. You should try to do a test with e.g 10000 documents to see a real difference. Values for your fields in indexes should be different enough to ensure index selectivity for your queries (e.g. not all documents are from the same owner).

Question 3

TL;DR: I believe you are using the wrong algorithm/data structure for the tool, or vice-versa. I'd suggest using a fan-out approach as discussed in this SO question, or my blog post. Sorry for shamelessly advertising my previous posts, but it doesn't make sense to repeat that info here.

The philosophy of MongoDB is, contrary to the typical SQL-philosophy, to be rather write-heavy. You are essentially trying to implement a ranking algorithm in a MongoDB query, but MongoDB's query philosophy is "query by example". That's not a good fit.

Sure, the aggregation pipeline is doesn't fit that philosophy anymore, and things maybe will change. There are optimizations on the way that allow for more complex queries, such as index intersection.

Still, what you are doing here is very hard to control. You not only want MongoDB to use index intersection (new in 2.6, only works with two indexes currently), but you're also combining it with $in queries and compound indexes. That's a lot to ask, and if the number of friends in the $in grows too much, you're all outta luck anyhow. The same is true if a piece of news is shared with too many people, worst case a document grows past 16MB. Growing documents is expensive, complex queries are expensive, large documents are expensive too.

I suggest you use a fan-out approach for newsfeeds where you can implement a very complex ranking algorithm in code, rather than in MongoDB.

I'm not saying it's impossible to optimize your query, but since explain's output is so ginourmous and there are so many effects interacting here (typical array sizes, typical match ratio, selectivity of the indexes, etc.), it will be very hard to find a good solution for this problem, even for someone who has full access to the data (i.e. you).

Even if you get this to work, you might run into critical problems if your access patterns change, the data changes, etc., so you'll be dealing with a fragile construct.