BeautifulSoup4 parse everything except specific tags

https://stackoverflow.com/questions/23305256

09-07-2023
|

Question

I am using python's BeautifulSoup to parse some HTML. The problem is I want to extract only the text of the document execept for <ul> and <li> tags. Sort of the opposite of unwrap(). Thus I want a function parse_everything_but_lists that will have the following behaviour

>>> parse_everything_but_lists("Hello <a>this</a> is <ul><li>me</li><li><b>Dr</b> Pablov</li></ul>")
"Hello this is <ul><li>me</li><li>Dr Pablov</li></ul>"

La solution

You can still use unwrap, you just need to get a bit recursive.

from bs4 import Tag

def unwrapper(tags, keep = ('ul','li')):
    for el in tags:
        if isinstance(el,Tag):
            unwrapper(el) # recurse first, unwrap later
            if el.name not in keep:
                el.unwrap()

demo:

s = '''"Hello <a>this</a> is <ul><li>me</li><li><b>Dr</b> Pablov</li></ul>"'''

soup = BeautifulSoup(s, 'html.parser') # force html.parser to avoid lxml's auto-inclusion of <html><body>

unwrapper(soup)

soup
Out[63]: "Hello this is <ul><li>me</li><li>Dr Pablov</li></ul>"

This approach should work on any arbitrary nestings of tags, i.e.

s = '''"<a><b><ul><c><li><d>Hello</d></li></c></ul></b></a>"'''

soup = BeautifulSoup(s, 'html.parser')
unwrapper(soup)

soup
Out[19]: "<ul><li>Hello</li></ul>"

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow