Web scraping using urllib2

https://stackoverflow.com/questions/20598818

02-09-2022
|

Question

I am trying to scrape all the titles off of this RSS Feed:

http://www.quora.com/Python-programming-language-1/rss

This is my code for the same:

import urllib2
import re
content = urllib2.urlopen('http://www.quora.com/Python-programming-language-1/rss').read()
allTitles =  re.compile('<title>(.*)</title>')
list = re.findall(allTitles,content)
for e in range(0, 2):
    print list[e]

However, instead of getting a list of titles as the output, I am getting a bunch of code from the rss source. What am I doing wrong?

Solution

You should use non-greedy mark (?) in expression:

#allTitles =  re.compile('<title>(.*)</title>')
allTitles =  re.compile('<title>(.*?)</title>')

Without ? all text except last </title> placed in (.*) group...

OTHER TIPS

As already mentioned, your code lacks greedy specifier for regexp, and can be fixed with it. But I strongly recommend switching from regular expressions to tools, more suited for xml parsing, such as lxml, BeautifulSoup or specialised rss parsing modules such as feedparser.

For example, see how your task can be done with lxml:

>>> import lxml.etree
>>> rss = lxml.etree.fromstring(content)
>>> titles = rss.findall('.//title')
>>> print '\n'.join(title.text for title in titles[:2])
Questions About Python (programming language) on Quora
Could someone explain for me the following Python function that uses @wraps from functools?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow