Python Regular Expressions Excluding Tags

https://stackoverflow.com/questions/23434868

14-07-2023
|

Question

I have written a script which is posted below, which basically goes to the plain text dictionary website and searches for the entered word and retrieves the definition. The only problem is it returns with the closing paragraph tags aswell, i have messed around with this for ages.

#!/usr/bin/python
import urllib2
import re
import sys


word = 'Xylophone'
page = urllib2.urlopen('http://www.mso.anu.edu.au/~ralph/OPTED/v003/wb1913_'+word[0].lower()+'.html')
html = page.read()

match = re.search(r'<P><B>'+word+'</B>.............(.*)', html)

if match: 
    print match.group(1)
else: print 'not found'

This returns the definition with tags. Whats the correct regex syntax here to ignore tags?

Solution

Prerequisite: read RegEx match open tags except XHTML self-contained tags famous topic.

Since it is an html page you are parsing, I'd use a specific tool made for this - an HTML parser.

For example, BeautifulSoup:

import urllib2
from bs4 import BeautifulSoup

word = 'Xylophone'
page = urllib2.urlopen('http://www.mso.anu.edu.au/~ralph/OPTED/v003/wb1913_'+word[0].lower()+'.html')
soup = BeautifulSoup(page)

print soup.find('b', text=word).parent.text

prints:

Xylophone (n.) An instrument common among the Russians, Poles, and Tartars, consisting of a series of strips of wood or glass graduated in length to the musical scale, resting on belts of straw, and struck with two small hammers. Called in Germany strohfiedel, or straw fiddle.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow