Question

I want to fetch the title of a webpage which I open using urllib2. What is the best way to do this, to parse the html and find what I need (for now only the -tag but might need more in the future).

Is there a good parsing lib for this purpose?

Was it helpful?

Solution

Yes I would recommend BeautifulSoup

If you're getting the title it's simply:

soup = BeautifulSoup(html)
myTitle = soup.html.head.title

or

myTitle = soup('title')

Taken from the documentation

It's very robust and will parse the html no matter how messy it is.

OTHER TIPS

Try Beautiful Soup:

url = 'http://www.example.com'
response = urllib2.urlopen(url)
html = response.read()

soup = BeautifulSoup(html)
title = soup.html.head.title
print title.contents

Why are you guys importing a whole extra library for one task. No regular expressions? wasn't the request for urllib not bs4 or mech which are third party? to do with standard libraries parse the html and match the string then split the '>' '<' with re or whateves.

N=(len(html))
for a in html(N):
    if '<title>' in a:
        Title=(str(a))

thats python 2 I think, you can strip it

Use Beautiful Soup.

html = urllib2.urlopen("...").read()
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
print soup.title.string
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top