retrieve large number of records with mongoDB in a reasonable time

https://stackoverflow.com/questions/8154608

02-03-2021
|

Question

I'm using mongoDB to store a querylog and get some stats about it. Objects that I store in mongoDB contains the text of the query, the date, the user, if the user clicked on some results etc etc.

Now i'm trying to retrieve all the queries not clicked by a user in a certain day with java. My code is approximately this:

    DBObject query = new BasicDBObject();
    BasicDBObject keys = new BasicDBObject();
    keys.put("Query", 1);
    query.put("Date", new BasicDBObject("$gte", beginning.getTime()).append("$lte", end.getTime()));
    query.put("IsClick", false);
    ...
    DBCursor cur = mongoCollection.find(query, keys).batchSize(5000);

The output of the query contains about 20k records that I need to iterate. The problem is that it takes minutes :( . I don't think is normal. From the server log i see:

Wed Nov 16 16:28:40 query db.QueryLogRecordImpl ntoreturn:5000 reslen:252403 nscanned:59260 { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false }  nreturned:5000 2055ms
Wed Nov 16 16:28:40 getmore db.QueryLogRecordImpl cid:4312057226672898459 ntoreturn:5000 query: { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false }  bytes:232421 nreturned:5000 170ms
Wed Nov 16 16:30:27 getmore db.QueryLogRecordImpl cid:4312057226672898459 ntoreturn:5000 query: { Date: { $gte: 1283292000000, $lte: 1283378399999 }, IsClick: false }  bytes:128015 nreturned:2661 --> 106059ms

So retrieving the first chunk takes 2 seconds, the second 0.1 seconds, the third 106 seconds!!! weird.. I tried changing the batch size, creating indexes on Date and IsClick, rebooting the machine :P but no way. Where I'm wrong?

Solution

There are several factors here that can affect speed. It will be necessary to gather some extra data to identify the cause here.

Some potential issues:

Indexes: are you using the right indexes? You should probably be indexing on IsClick/Date. That puts the range second which is the normal suggestion. Note that this is different from indexing on Date/IsClick, order is important. Try a .explain() on your query to see what indexes are being used.
Data Size: in some cases, slowness can be caused by too much data. This could be too many documents or too many large documents. It can also be caused by trying to find too many needles in a really large haystack. You are bringing back 252k in data (reslen) and 12k documents, so this is probably not the problem.
Disk IO: MongoDB uses memory-mapped files and therefore uses lots of virtual memory. If you have more data than RAM then fetching certain documents requires "going to disk". Going to disk can be a very expensive operation. You can identify "going to disk" by using tools like iostat or resmon (Windows) to monitor the disk activity.

Based on personal experience, I strongly suspect #3, with a possible exacerbation from #1. I would start with watching the IO while running a .explain() query. This should quickly narrow down the range of possible problems.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow