Extracting readable text from HTML using Python?

https://stackoverflow.com/questions/3172343

02-10-2019
|

Question

I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them.

htmlDom = BeautifulSoup(webPage)

htmlDom.findAll(text=True)

Alternately,

from stripogram import html2text
extract = html2text(webPage)

Both of these extract all the javascript on the page as well, this is undesired.

I just wanted the readable text which you could copy from your browser to be extracted.

Solution

If you want to avoid extracting any of the contents of script tags with BeautifulSoup,

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

will do that for you, getting the root's immediate children which are non-script tags (and a separate htmlDom.findAll(recursive=False, text=True) will get strings that are immediate children of the root). You need to do this recursively; e.g., as a generator:

def nonScript(tag):
    return tag.name != 'script'

def getStrings(root):
   for s in root.childGenerator():
     if hasattr(s, 'name'):    # then it's a tag
       if s.name == 'script':  # skip it!
         continue
       for x in getStrings(s): yield x
     else:                     # it's a string!
       yield s

I'm using childGenerator (in lieu of findAll) so that I can just get all the children in order and do my own filtering.

OTHER TIPS

Using BeautifulSoup, something along these lines:

def _extract_text(t):
    if not t:
        return ""
    if isinstance(t, (unicode, str)):
        return " ".join(filter(None, t.replace("\n", " ").split(" ")))
    if t.name.lower() == "br": return "\n"
    if t.name.lower() == "script": return "\n"
    return "".join(extract_text(c) for c in t)
def extract_text(t):
    return '\n'.join(x.strip() for x in _extract_text(t).split('\n'))
print extract_text(htmlDom)

you can remove script tags in beautiful soup, something like:

for script in soup("script"):
    script.extract()

Removing Elements

Try it out:

http://code.google.com/p/boilerpipe/

http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow