Question

I come up with an interview question that I would like to know your opinion about that. The questions are say that in designing a web crawler:

1) what kind of pages will you hit with a DFS versus BFS?

2) how would you avoid getting into infinite loops?

I appreciate if somebody could answer them.

Was it helpful?

Solution

1) what kind of pages will you hit with a DFS versus BFS?

In most situations, I will use BFS algorithm to implement a spider because most valuable info I want to get from web pages doesn't have much link depth, otherwise I think the site is not much valuable to crawl because of the bad design.

If I want to get some specific data from one page and other related data from a few hops and at the same time I want to see the results soon after the spider runs, then I may choose DFS algorithm. Say, I want to get all the tags from stackoverflow. The tag page is here. At the same time, I want to get who answer what questions in the tag. And I want to check whether the spider runs properly. Then I use DFS algorithm to get the data tag-questions-answers soon after the spider runs.

In a word, it depends on the usage scenario.

2) how would you avoid getting into infinite loops?

This question may be simple. Solutions are such as:

  • Use MAX LINK DEPTH.
  • Record urls that you have crawled and before emit a new request, check whether the url has been crawled.

I remember scrapy seems could solve the second question. You could read its source code to search for a better solution.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top