Question

I'm experimenting with http://robobrowser.readthedocs.org/en/latest/readme.html, a new python library based on the beautiful soup library. with some help, I have returned an html page within a django app, but I can't figure out to strip the tags to give me just the text . My django app contains :

def index(request):    

    from django.utils.html import strip_tags
    p=str(request.POST.get('p', False)) # p='https://www.yahoo.com/'
    browser = RoboBrowser(history=True)
    browser.open(p)
    html = browser.response
    stripped = strip_tags(html)
    return HttpResponse(stripped )

when I look at the outputted html I see that it is the same as the original html. Also I don't think robobrowser has the text() method of beautiful soup.

I also tried (from Python code to remove HTML tags from a string):

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""    

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c    

    return out

Same result! How can I remove the html tags and return the text?

Was it helpful?

Solution

BeautifulSoup provides the soup::get_text() method for extracting text from a parsed HTML document (somewhat confusingly, this is equivalent to the getText method and the text property). You can access the parsed HTML of the current page using browser.parsed. So, to get the plain text of the current page, try

text = browser.parsed.get_text()

OTHER TIPS

I prefer using bleach.

Here's some example code:

import Bleach
varName = ( bleach.clean( result.find_all( class_ = 'className' ),
                          strip  = True
                          )
            ).strip( '[])' )
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top