For commercial use, read the terms of services First
There are really not that much information the server know about who is making this request. Either IP, User-Agent or Cookie... Sometimes the urllib2 will not grab the information that are generated by JavaScript.
JavaScript or Not?
(1) You need to open up the chrome developer and disable the cache and Javascript to make sure that you can still see the information that you want. If you cannot see the information there, you have to use some tool that support Javascript like Selenium or PhantomJS.
However, in this case, your website looks it is not that sophisticated.
User-Agent? Cookie? (2) Then the problem ends up tuning User-Agent or Cookies. As you have tried before, the user agent seems like not enough. Then it will be the cookie that will play the trick.
As you can see, the first page call actually returns temporarily unavailable and you need to click the rss HTML that with 200 return code. You just need to copy the user-agent and cookies from there and it will work.
Here are the codes how to add cookie using urllib2
import urllib2, bs4, re
opener = urllib2.build_opener()
opener.addheaders = [("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36")]
# I omitted the cookie here and you need to copy and paste your own
opener.addheaders.append(('Cookie', 'act-bg-i...eat_uuniq=1; criteo=; pl=true'))
soup = bs4.BeautifulSoup(opener.open("http://www.heraldsun.com.au/help/rss"))
div = soup.find('div', {"id":"content-2"}).find('div', {"class":"group-content"})
for a in div.find_all('a'):
try:
if 'feeds.news' in a['href']:
print a
except:
pass
And here are the outputs:
<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_breakingnews_2800.xml">Breaking News</a>
<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_topstories_2803.xml">Top Stories</a>
<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_worldnews_2793.xml">World News</a>
<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_morenews_2794.xml">Victoria and National News</a>
<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_sport_2789.xml">Sport News</a>
...