Question

I want to create a whitelist of allowed tags in a soup, and remove the rest (essentially whitelisting certain tags). Something like this, except working:

HTML:

<title>title</title>
<p>p</p>
<span>span</span>
<script>script</script>

Python:

>>> p = soup.find_all('p')
>>> span =  soup.find_all('span')
>>> title = soup.find_all('title')
>>> whitelist = p + span + title

>>> [el.extract() for el in soup.find_all() if el not in whitelist]

This just returns a blank soup. How can this made work?

Was it helpful?

Solution

All you need to do is to provide a callable to find_all to tell it which tags to keep.

s = '''<title>title</title>
<p>p</p>
<span>span</span>
<script>script</script>'''

soup = BeautifulSoup(s)

keepset = {'title','p','span'}

soup.find_all(lambda tag: tag.name in keepset)
Out[59]: [<title>title</title>, <p>p</p>, <span>span</span>]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top