Question

The page that I'm scraping contains these HTML codes. How do I remove the comment tag <!-- --> along with its content with bs4?

<div class="foo">
cat dog sheep goat
<!-- 
<p>NewPP limit report
Preprocessor node count: 478/300000
Post‐expand include size: 4852/2097152 bytes
Template argument size: 870/2097152 bytes
Expensive parser function count: 2/100
ExtLoops count: 6/100
</p>
-->
</div>
Was it helpful?

Solution

You can use extract() (solution is based on this answer):

PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.

from bs4 import BeautifulSoup, Comment

data = """<div class="foo">
cat dog sheep goat
<!--
<p>test</p>
-->
</div>"""

soup = BeautifulSoup(data)

div = soup.find('div', class_='foo')
for element in div(text=lambda text: isinstance(text, Comment)):
    element.extract()

print soup.prettify()

As a result you get your div without comments:

<div class="foo">
    cat dog sheep goat
</div>

OTHER TIPS

Usually modifying the bs4 parse tree is unnecessary. You can just get the div's text, if that's what you wanted:

soup.body.div.text
Out[18]: '\ncat dog sheep goat\n\n'

bs4 separates out the comment. However if you really need to modify the parse tree:

from bs4 import Comment

for child in soup.body.div.children:
    if isinstance(child,Comment):
        child.extract()

From this answer If you are looking for solution in BeautifulSoup version 3 BS3 Docs - Comment

soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
comment = soup.find(text=re.compile("if"))
Comment=comment.__class__
for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()
print soup.prettify()

Little late, but i have compared main answers on internet so you can choose whats best for you:

we can do the removal of comments by regex also

soupstr=str(soup)
result=re.sub(r'<!.*?->','', soupstr)

but this method of regex is 4 times slower when we convert soup to string via soupstr=str(soup) than findAll...isinstance(x,Comment) as written by others.

But is 5 times faster when you have html as string and apply regex processing to remove comments.

benchmark result after running functions 1000 times:

bs4,isinstance(x,Comment) method: time: 0.01193189620971680ms
soup convert to string and apply regex: 0.04188799858093262ms
apply regex before converting to soup : 0.00195980072021484ms (WINNER!)

maybe you can use pure regex in cases where you dont want to use isinstance method.

for people who need quick result and dont want to read full answer, here is the copy paste function ready to run:

def remove_comments_regexmethod(soup): 
    #soup argument can be string or bs4.beautifulSoup instance it will auto convert to string, please prefer to input as (string) than (soup) if you want highest speed
    if not isinstance(soup,str): 
        soup=str(soup)
    return re.sub(r'<!.*?->','', soup)#returns a string
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top