DFS vs BFS in web crawler design [closed]

https://stackoverflow.com/questions/20579169

01-09-2022
|

Question

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 8 years ago.

The community reviewed whether to reopen this question 7 months ago and left it closed:

Original close reason(s) were not resolved

Improve this question

I come up with an interview question that I would like to know your opinion about that. The questions are say that in designing a web crawler:

1) what kind of pages will you hit with a DFS versus BFS?

2) how would you avoid getting into infinite loops?

I appreciate if somebody could answer them.

Solution

1) what kind of pages will you hit with a DFS versus BFS?

In most situations, I will use BFS algorithm to implement a spider because most valuable info I want to get from web pages doesn't have much link depth, otherwise I think the site is not much valuable to crawl because of the bad design.

If I want to get some specific data from one page and other related data from a few hops and at the same time I want to see the results soon after the spider runs, then I may choose DFS algorithm. Say, I want to get all the tags from stackoverflow. The tag page is here. At the same time, I want to get who answer what questions in the tag. And I want to check whether the spider runs properly. Then I use DFS algorithm to get the data tag-questions-answers soon after the spider runs.

In a word, it depends on the usage scenario.

2) how would you avoid getting into infinite loops?

This question may be simple. Solutions are such as:

Use MAX LINK DEPTH.
Record urls that you have crawled and before emit a new request, check whether the url has been crawled.

I remember scrapy seems could solve the second question. You could read its source code to search for a better solution.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow