Question

I have created 4 indexes to test query performance in my collection when quering for two fields of the same document, one of which is an array (needs a multi-key index). Two of the indexes are single and two compound.

I am surpised because of getting better performance with one of the single indexes than with the compound ones. I was expecting to obtain the best performace with a compound index, because I understand that it indexes the two fields allowing for faster querying.

These are my indexes:

{    "v" : 1, 
     "key" : { "_id" : 1 }, 
     "ns" : "bt_twitter.mallorca.mallorca", 
     "name" : "_id_"  
}, 
{    "v" : 1, 
     "key" : { "epoch_creation_date" :1 }, 
     "ns" : "bt_twitter.mallorca.mallorca", 
     "name" : "epoch_creation_date_1"  
}, 
{     "v" : 1, 
      "key" : { "related_hashtags" : 1 }, 
      "ns" : "bt_twitter.mallorca.mallorca", 
      "name" : "related_hashtags_1"  
},  
{     "v" : 1, 
      "key" : { "epoch_creation_date" : 1, "related_hashtags" : 1 }, 
      "ns" : "bt_twitter.mallorca.mallorca", 
      "name" : "epoch_creation_date_1_related_hashtags_1"  
}

My queries and performance indicators are (hint parameter shows the index used at each query):

QUERY 1:

active_collection.find(
    {'epoch_creation_date': {'$exists': True}}, 
    {"_id": 0, "related_hashtags":1}
).hint([("epoch_creation_date", ASCENDING)]).explain()

millis: 237

nscanned: 101226

QUERY 2:

active_collection.find(
    {'epoch_creation_date': {'$exists': True}}, 
    {"_id": 0, "related_hashtags": 1}
).hint([("related_hashtags", ASCENDING)]).explain()

millis: 1131

nscanned: 306715

QUERY 3:

active_collection.find(
     {'epoch_creation_date': {'$exists': True}},
     {"_id": 0, "related_hashtags": 1}
).hint([("epoch_creation_date", ASCENDING), ("related_hashtags", ASCENDING)]).explain()

millis: 935

nscanned: 306715

QUERY 4:

active_collection.find(
     {'epoch_creation_date': {'$exists': True}}, 
     {"_id": 0, "related_hashtags": 1}
).hint([("related_hashtags", ASCENDING),("epoch_creation_date", ASCENDING)]).explain()

millis: 1165

nscanned: 306715

QUERY 1 scans less documents, what is probably the reason to be faster. Can somebody help me to understand why is it performing better than queries with compound indexes? Therefore, when is better to use a compound index than a single one?

I am reading mongo documentation but these concepts are resulting hard for me to digest.

Thanks in advance.

UPDATED question (in response to Sammaye and Philipp)

This is the result of a full explain()

"cursor" : "BtreeCursor epoch_creation_date_1",
"isMultiKey" : false,
"n" : 101226,
"nscannedObjects" : 101226,
"nscanned" : 101226,
"nscannedObjectsAllPlans" : 101226,
"nscannedAllPlans" : 101226,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 242,
"indexBounds" : {u'epoch_creation_date': [[{u'$minElement': 1}, {u'$maxElement': 1}]]

},
"server" : "vmmongodb:27017"

for the following query:

active_collection.find(
{'epoch_creation_date': {'$exists': True}}, 
{"_id": 0, "related_hashtags":1})
.hint([("epoch_creation_date", ASCENDING)]).explain()
Was it helpful?

Solution

You created a compound index (named epoch_creation_date_1_related_hashtags_1), but you aren't using it in those hints. Instead of that you are using the two single-field indexes you also created (related_hashtags_1 and epoch_creation_date_1) in different order.

Of those two indexes, only epoch_creation_date_1 is effective, because you aren't querying for both fields. You are only querying for one, and this is 'epoch_creation_date': {'$exists': True}. The field-filtering which you perform with {"_id": 0, "related_hashtags":1} is done on the documents which were found by that query. At that point, indexes are of no use anymore. That means any index on related_hashtags won't be able to increase performance on this query. The compound index (when you would actually use it) might be better than no index at all, but not as good as the index on epoch_creation_date only.

OTHER TIPS

Ok after reading the question more I understand the problem. The multikey index will write an index entry PER multivalue. This means if you have 3 values per related_hashtags per document your index is actually 3x the size and has 3x the number of values to scan (if my math adds up there...).

nscanned is a counter for how times a document had to be looked at (note counter, not a specific number of unique documents looked at), this means that due to the multikey index you had to scan roughly 3x the amount of (same) documents you normally would for the first query.

This is a known caveat with multikey indexes and why you should be careful about just throwing them around like this.

I believe the reason why the third query is so slow is because multikey indexes cannot support indexOnly cursors so MongoDB could not use covered queries there.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top