Question

I have this strange problem parsing the webpage Herald Sun to get the list of rss from it. When I look at the webpage in the browser, I can see the links with titles. Though, when I used Python and Beautiful Soup to parse the page, the response does not even have the section I would like to parse.

hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9) AppleWebKit/537.71 (KHTML, like Gecko) Version/7.0 Safari/537.71',
               'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
               'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
               'Accept-Encoding': 'none',
               'Accept-Language': 'en-US,en;q=0.8',
               'Connection': 'keep-alive'}

req = urllib.request.Request("http://www.heraldsun.com.au/help/rss", headers=hdr)

try:
    page = urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
    print(e.fp.read())

html_doc = page.read()

f = open("Temp/original.html", 'w')
f.write(html_doc.decode('utf-8'))

The written file as you can check, does not have the results in there, so obviously, Beautiful Soup has nothing to do here.

I wonder, how does the webpage enable this protection and how to overcome it? Thanks,

Was it helpful?

Solution

For commercial use, read the terms of services First

There are really not that much information the server know about who is making this request. Either IP, User-Agent or Cookie... Sometimes the urllib2 will not grab the information that are generated by JavaScript.

JavaScript or Not?

(1) You need to open up the chrome developer and disable the cache and Javascript to make sure that you can still see the information that you want. If you cannot see the information there, you have to use some tool that support Javascript like Selenium or PhantomJS. enter image description here

However, in this case, your website looks it is not that sophisticated.

User-Agent? Cookie? (2) Then the problem ends up tuning User-Agent or Cookies. As you have tried before, the user agent seems like not enough. Then it will be the cookie that will play the trick.

enter image description here

As you can see, the first page call actually returns temporarily unavailable and you need to click the rss HTML that with 200 return code. You just need to copy the user-agent and cookies from there and it will work.

enter image description here

Here are the codes how to add cookie using urllib2

import urllib2, bs4, re

opener = urllib2.build_opener()
opener.addheaders = [("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36")]
# I omitted the cookie here and you need to copy and paste your own
opener.addheaders.append(('Cookie', 'act-bg-i...eat_uuniq=1; criteo=; pl=true'))
soup = bs4.BeautifulSoup(opener.open("http://www.heraldsun.com.au/help/rss"))
div = soup.find('div', {"id":"content-2"}).find('div', {"class":"group-content"})

for a in div.find_all('a'):
    try:
        if 'feeds.news' in a['href']:
            print a 
    except:
        pass

And here are the outputs:

<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_breakingnews_2800.xml">Breaking News</a>
<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_topstories_2803.xml">Top Stories</a>
<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_worldnews_2793.xml">World News</a>
<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_morenews_2794.xml">Victoria and National News</a>
<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_sport_2789.xml">Sport News</a>
...

OTHER TIPS

The site could very likely be serving different content, depending on the User-Agent string in the headers. Websites will often do this for mobile browsers, for example.

Since you're not specifying one, urllib is going to use its default:

By default, the URLopener class sends a User-Agent header of urllib/VVV, where VVV is the urllib version number.

You could try spoofing a common User-Agent string, by following the advice in this question. See What's My User Agent?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top