Вопрос

I am trying to find a way to extract the depth of a website using python. Depth of a subwebsite is equal to the number of clicks required from the main website (e.g. www.ualberta.ca) in order for a user to get to the subwebsite (e.g. www.ualberta.ca/beartracks). so for instance if it takes one additional click to get to a subwebsite from the main domain, the depth of the subwebsite would be 1.

is there anyway for me to measure this using python? thank you!

Это было полезно?

Решение

(1) You have to make sure that your target website is static.. For example, a website like Amazon, their websites are populated by their database and their database is somehow driven by the customer.. Amazon's database is changing every second. In this way, you proved that the depth of a page which contains "glove" is 7, next minute, the depth turned out to be 3 because "scarf" is on the first page and "glove" is on the "People also bought this" list. So many factors could change your target website.

(2) if the concern above is not a problem. You need to build some crawler/spider to help you collect all the pages.(Might not be the collection of all the raw HTMLs but records look like this:)

currentURL  [links]
urlpage1 [urlpage2. urlpage3.. ]
urlpage2 [urlpage1, urlpage3...]
... 

Here are some tools to help you implement it.

Scrapy (Python)

Apache Nutch (shell/Java based)

(3) Assuming you have collected all mapping relations between pages. You just need one step further to calculate the "shorted length" between the page you want and the main page. Now you need a tool to analyze what would be the depth. The math model here will be similar like the "social network analysis". And some graph database like

"Neo4j" Plus

"Gephi"

will be perfect for those type of task. And also in the end, you will have a beautiful and presentable result. You can also use some packages in R to do it.

This is actually a pretty interesting question and it involves a whole series of different programming capabilities. So good luck with your project and Stackoverflow will help you through.

Другие советы

It sounds like you want to write a spider to do breadth-first search from the first url until you find a link to the second url.

I suggest you look at the Scrapy package; it makes it very easy to do.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top