Question

I'm working on a small project, a site scraper, and I've run into a problem that (I think) with urllib.open(). So, let's say I want to scrape Google's homepage, a concatenated query, and then a search query. (I'm not actually trying to scrape from google, but I figured they'd be easy to demonstrate on.)

from bs4 import BeautifulSoup
import urllib

url = urllib.urlopen("https://www.google.com/")

soup = BeautifulSoup(url)

parseList1=[]

for i in soup.stripped_strings:
    parseList1.append(i)

parseList1 = list(parseList1[10:15])

#Second URL

url2 = urllib.urlopen("https://www.google.com/"+"#q=Kerbal Space Program")

soup2 = BeautifulSoup(url2)

parseList2=[]

for i in soup2.stripped_strings:
    parseList2.append(i)

parseList2 = list(parseList2[10:15])

#Third URL

url3 = urllib.urlopen("https://www.google.com/#q=Kerbal Space Program")

soup3 = BeautifulSoup(url3)

parseList3=[]

for i in soup3.stripped_strings:
    parseList3.append(i)

parseList3 = list(parseList3[10:15])

print " 1 "

for i in parseList1:
    print i

print " 2 "

for i in parseList2:
    print i

print " 3 "

for i in parseList3:
    print i

This prints out:

1

A whole nasty mess of scraped code from Google

2

3

Which leads me to believe that the # symbol might be preventing the url from opening? The concatenated string doesn't throw any errors for concatenation, yet still doesn't read anything in.

Does anyone have any idea on why that would happen? I never thought that a # inside a string would have any effect on the code. I figured this would be some silly error on my part, but if it is, I can't see it.

Thanks

Was it helpful?

Solution

Browsers should not send the url fragment part (ends with "#") to servers.

RFC 1808 (Relative Uniform Resource Locators) : Note that the fragment identifier (and the "#" that precedes it) is not considered part of the URL. However, since it is commonly used within the same string context as a URL, a parser must be able to recognize the fragment when it is present and set it aside as part of the parsing process.

You can get the right result in browsers because a browser send a request to https://www.google.com, the url fragment is detected by javascript(It is similar with spell checking here and most web sites won't do this), browser then send a new ajax request(https://www.google.com?q=xxxxx), finally render the page with the json data got. urllib can not execute javascript for you.

To fix your problem, just replace https://www.google.com/#q=Kerbal Space Program with https://www.google.com/?q=Kerbal Space Program

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top