Since reads in MongoDB are concurrent I completely understand what your saying. Yes, it is possible for two workers to pick the same row, amend it and then re-save it overwriting each other (not to mention wasted resources on crawling).
I believe you must accept that one way or another you will lose performance, that is an unfortunate part of ensuring consistency.
You could use findAndModify
to pick exclusively, since findAndModify
has isolation it can ensure that you only pick a URL that has not been picked before. The problem is that findAndModify
, due to being isolated, to slow down the rate of your crawling.
Another way could be to do an optimistic lock whereby you write a lock to the database rows picked very quickly after picking them, this will mean that there is some wastage when it comes to crawling duplicate URLs but it does mean you will get the maximum performance and concurrency out of your workers.
Which one you go for requires you to test and discover which best suites you.