Question

I'm trying to extract text from arbitrary html pages. Some of the pages (which I have no control over) have malformed html or scripts which make this difficult. Also I'm on a shared hosting environment, so I can install any python lib, but I can't just install anything I want on the server.

pyparsing and html2text.py also did not seem to work for malformed html pages.

Example URL is http://apnews.myway.com/article/20091015/D9BB7CGG1.html

My current implementation is approximately the following:

# Try using BeautifulSoup 3.0.7a
soup = BeautifulSoup.BeautifulSoup(s) 
comments = soup.findAll(text=lambda text:isinstance(text,Comment))
[comment.extract() for comment in comments]
c=soup.findAll('script')
for i in c:
    i.extract()    
body = bsoup.body(text=True)
text = ''.join(body) 
# if BeautifulSoup  can't handle it, 
# alter html by trying to find 1st instance of  "<body" and replace everything prior to that, with "<html><head></head>"
# try beautifulsoup again with new html 

if beautifulsoup still does not work, then I resort to using a heuristic of looking at the 1st char, last char (to see if they looks like its a code line # < ; and taking a sample of the line and then check if the tokens are english words, or numbers. If to few of the tokens are words or numbers, then I guess that the line is code.

I could use machine learning to inspect each line, but that seems a little expensive and I would probably have to train it (since I don't know that much about unsupervised learning machines), and of course write it as well.

Any advice, tools, strategies would be most welcome. Also I realize that the latter part of that is rather messy since if I get a line that is determine to contain code, I currently throw away the entire line, even if there is some small amount of actual English text in the line.

Was it helpful?

Solution

Try not to laugh, but:

class TextFormatter:
    def __init__(self,lynx='/usr/bin/lynx'):
        self.lynx = lynx

    def html2text(self, unicode_html_source):
        "Expects unicode; returns unicode"
        return Popen([self.lynx, 
                      '-assume-charset=UTF-8', 
                      '-display-charset=UTF-8', 
                      '-dump', 
                      '-stdin'], 
                      stdin=PIPE, 
                      stdout=PIPE).communicate(input=unicode_html_source.encode('utf-8'))[0].decode('utf-8')

I hope you've got lynx!

OTHER TIPS

Well, it depends how good the solution has to be. I had a similar problem, importing hundreds of old html pages into a new website. I basically did

# remove all that crap around the body and let BS fix the tags
newhtml = "<html><body>%s</body></html>" % (
    u''.join( unicode( tag ) for tag in BeautifulSoup( oldhtml ).body.contents ))
# use html2text to turn it into text
text = html2text( newhtml )

and it worked out, but of course the documents could be so bad that even BS can't salvage much.

BeautifulSoup will do bad with malformed HTML. What about some regex-fu?

>>> import re
>>> 
>>> html = """<p>This is paragraph with a bunch of lines
... from a news story.</p>"""
>>> 
>>> pattern = re.compile('(?<=p>).+(?=</p)', re.DOTALL)
>>> pattern.search(html).group()
'This is paragraph with a bunch of lines\nfrom a news story.'

You can then assembly a list of valid tags from which you want to extract information.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top