Pregunta

i wrote a code to extract information from a website by giving a search term using mechanize. the result has html tags and other details along with the text.i need to extract only the text.help me to modify the code

import mechanize
br=mechanize.Browser()
br.set_handle_robots( False )
br.addheaders = [('User-agent', 'Firefox')]
r=br.open("http://www.drugs.com/search-wildcard-phonetic.html")
br.select_form(nr=0)
br.form['searchterm']='panadol'
br.submit()
print br.response().read()
¿Fue útil?

Solución

This appears to be the same question as Python code to remove HTML tags from a string which points to Strip HTML from strings in Python

Copying the top answer from that question gives:

I always used this function to strip HTML tags, as it requires only the Python stdlib:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
  def __init__(self):
    self.reset()
    self.fed = []
  def handle_data(self, d):
    self.fed.append(d)
  def get_data(self):
    return ''.join(self.fed)

def strip_tags(html):
  s = MLStripper()
  s.feed(html)
  return s.get_data()
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top