1) what kind of pages will you hit with a DFS versus BFS?
In most situations, I will use BFS algorithm to implement a spider because most valuable info I want to get from web pages doesn't have much link depth, otherwise I think the site is not much valuable to crawl because of the bad design.
If I want to get some specific data from one page and other related data from a few hops and at the same time I want to see the results soon after the spider runs, then I may choose DFS algorithm. Say, I want to get all the tags from stackoverflow. The tag page is here. At the same time, I want to get who answer what questions in the tag. And I want to check whether the spider runs properly. Then I use DFS algorithm to get the data tag-questions-answers soon after the spider runs.
In a word, it depends on the usage scenario.
2) how would you avoid getting into infinite loops?
This question may be simple. Solutions are such as:
- Use MAX LINK DEPTH.
- Record urls that you have crawled and before emit a new request, check whether the url has been crawled.
I remember scrapy seems could solve the second question. You could read its source code to search for a better solution.