Question

I've written a code which extracts all the links from the url specified. I took this idea from an online video tutorial. When I tried with nytimes.com, if worked out. But when I tried with yell.com, I had an error thrown: "Error: HTTP Error 416: Requested Range Not Satisfiable - http://www.yell.com/". What technique should I adopt in order to bypass this.

import urllib.parse;
import urllib;
from bs4 import BeautifulSoup;

##url = "http://nytimes.com";
url = "http://www.yell.com/";

urls = [url];   
visited = [url];

while(len(urls) > 0):

    try:
        htmltext = urllib.request.urlopen(urls[0]).read();

        soup = BeautifulSoup(htmltext);

        urls.pop(0);
        print(len(urls));

        for tag in soup.findAll('a',href=True) :
            tag['href'] = urllib.parse.urljoin(url,tag['href']);
            if(url in tag['href'] and tag['href'] not in visited) :
                urls.append(tag['href']);
                visited.append(tag['href']);

    except urllib.error.HTTPError as e:
        print("Error: " + str(e)
              + " - " + url);

print(visited);
Was it helpful?

Solution

What is happening here is that yell.com is detecting irregular activity. If you try to do this scraping visually using selenium, the way it would load Javascript:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # wait to load

# at this point, if you see the Firefox window that opened you will see the message

# Anyway, if you manage to get pass trough that blocking, you could load BeautifulSoup this way: 
soup = BeautifulSoup(driver.page_source)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top