Question

I am creating a simple web spider. All it does is accept a URL, download the HTML and extract the remaining URLs. It then repeats the process for each new URL. I'm also making sure I don't visit the same URL twice and I am limiting the number of concurrent downloads.

After every unique URL has been exhausted (could run for days, weeks or till after I'm dead and gone), I would like to perform an action, like updating the UI or simply exiting the application.

The problem is, I don't know how to detect when the last thread has finished running.

Has this threading problem been solved? Am I looking at the problem wrong?

One thought was to keep each thread alive until all of its children finished (join). The problem is that the number of threads grow exponentially. For such a long-running process, it would quickly exhaust OS resources.

Was it helpful?

Solution

I'm not sure what language we are talking about so I'll speak generically.

You need a data structure for each URL that keeps track of how many "children" pages get generated from it. Whenever an URL is being spidered, it will have a "parent" data structure. Whenever a new page is found, is added to the parent's tree count. Whenever a page is spidered, the parent's tree count is decremented. This will need to be done in a synchronized manner since multiple threads will be updating it.

You may actually want to save the entire URL structure. The root URL "http://foo.x/" has links to "/1.html" and "/2.html" so it's children-count is 2. The root URL has a null parent and "1" and "2" have a parent of the root. When "1.html" is spidered then the root's children-count is decremented to 1. But if there are 3 links inside of "1.html" then the root's count gets incremented to 4. If you want to keep track of the tree then "1.html" children count goes to 3, etc.. Then when one of the children of "1.html" gets spidered, the count for "1.html" goes to 2 and the root URL's count goes to 3.

You certainly do not want to be keeping the threads around and then joining later as you mention -- your thread count will explode. You should use a thread-pool and submit URLs to spidered, each with their associated node in the URL tree, to the pool so they can be spidered by the same threads.

When an URL is spidered, and the children count goes to 0 then you know that you have spidered the whole tree and the URL can be removed from the working-list and moved to the done-list. Again, these lists will need to be synchronized since multiple threads will be operating on them.

Hope this helps somewhat.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top