Вопрос

Im trying to collect all links from a webpage using requests, Beautifulsoup4 and SoupStrainer in Python3.3. For writing my code im using Komodo Edit 8.0 and also let my scripts run in Komodo Edit. So far everything works fine but on some webpages it occurs that im getting a popup with the following Warning

Warning unresponsive script

A script on this page may be busy, or it may have stopped responding. You can stop the script
now, or you can continue to see if the script will complete.

Script: viewbufferbase:797

Then i can chose if i want to continue or stop the script.

Here a little code snippet:

try:
    r = requests.get(adress, headers=headers)
    soup = BeautifulSoup(r.text, parse_only=SoupStrainer('a', href=True))
    for link in soup.find_all('a'):

        #some code

except requests.exceptions.RequestException as e:
    print(e)

My question is what is causing this error. Is it my python script that is taking too long on a webpage or is it a script on the webpage im scraping? I cant think of the latter because technically im not executing the scripts on the page right? Or can it maybe be my bad internet-connection?

Oh and another little question, with the above code snippet am im downloading pictures or just the plain html-code? Because sometimes when i look into my connection status for me its way too much data that im receiving just for requesting plain html code? If so, how can I avoid downloading such stuff and how is it possible in general to avoid downloads with requests, because sometimes it can be that my program ends on a download page.

Many Thanks!

Это было полезно?

Решение

The issue might be either long loading times of a site, or a cycle in your website links' graph - i.e. page1 (Main Page) has link to page2 (Terms of Service) which in turn has link to page1. You could try this snippet to see how long it takes to get a response from a website (snippet usage included).

Regarding your last question:

I'm pretty sure requests doesn't parse your response's content (except for .json() method). What you might be experiencing is a link to a resource, like <a href="http://www.example.com/very_big_file.exe">Free Cookies!</a> which you script would visit. requests have mechanics to counter such case, see this for reference. Moreover, the aforementioned technique allows checking Content-Type header to make sure you're downloading pages you're interested in.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top