measuring a depth of a website using python [closed]

Question 1

(1) You have to make sure that your target website is static.. For example, a website like Amazon, their websites are populated by their database and their database is somehow driven by the customer.. Amazon's database is changing every second. In this way, you proved that the depth of a page which contains "glove" is 7, next minute, the depth turned out to be 3 because "scarf" is on the first page and "glove" is on the "People also bought this" list. So many factors could change your target website.

(2) if the concern above is not a problem. You need to build some crawler/spider to help you collect all the pages.(Might not be the collection of all the raw HTMLs but records look like this:)

currentURL  [links]
urlpage1 [urlpage2. urlpage3.. ]
urlpage2 [urlpage1, urlpage3...]
...

Here are some tools to help you implement it.

Scrapy (Python)

Apache Nutch (shell/Java based)

(3) Assuming you have collected all mapping relations between pages. You just need one step further to calculate the "shorted length" between the page you want and the main page. Now you need a tool to analyze what would be the depth. The math model here will be similar like the "social network analysis". And some graph database like

"Neo4j" Plus

"Gephi"

will be perfect for those type of task. And also in the end, you will have a beautiful and presentable result. You can also use some packages in R to do it.

This is actually a pretty interesting question and it involves a whole series of different programming capabilities. So good luck with your project and Stackoverflow will help you through.

Question 2

It sounds like you want to write a spider to do breadth-first search from the first url until you find a link to the second url.

I suggest you look at the Scrapy package; it makes it very easy to do.