Question

I am building a web crawler in Python using MongoDB to store a queue with all URLs to crawl. I will have several independent workers that will crawl URLs. Whenever a worker completes crawling a URL, it will make a request in the MongoDB collection "queue" to get a new URL to crawl.

My issue is that since there will be multiple crawlers, how can I ensure that two crawlers won't query the database at the same time and get the same URL to crawl?

Thanks a lot for your help

Was it helpful?

Solution

Since reads in MongoDB are concurrent I completely understand what your saying. Yes, it is possible for two workers to pick the same row, amend it and then re-save it overwriting each other (not to mention wasted resources on crawling).

I believe you must accept that one way or another you will lose performance, that is an unfortunate part of ensuring consistency.

You could use findAndModify to pick exclusively, since findAndModify has isolation it can ensure that you only pick a URL that has not been picked before. The problem is that findAndModify, due to being isolated, to slow down the rate of your crawling.

Another way could be to do an optimistic lock whereby you write a lock to the database rows picked very quickly after picking them, this will mean that there is some wastage when it comes to crawling duplicate URLs but it does mean you will get the maximum performance and concurrency out of your workers.

Which one you go for requires you to test and discover which best suites you.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top