Error with Beautiful Soup's extract()
-
21-08-2019 - |
Question
I'm working on some screen scraping software and have run into an issue with Beautiful Soup. I'm using python 2.4.3 and Beautiful Soup 3.0.7a.
I need to remove an <hr>
tag, but it can have many different attributes, so a simple replace() call won't cut it.
Given the following html:
<h1>foo</h1>
<h2><hr/>bar</h2>
And the following code:
soup = BeautifulSoup(string)
bad_tags = soup.findAll('hr');
[tag.extract() for tag in bad_tags]
for i in soup.findAll(['h1', 'h2']):
print i
print i.string
The output is:
<h1>foo</h1>
foo
<h2>bar</h2>
None
Am I misunderstanding the extract function, or is this a bug with Beautiful Soup?
Solution
It may be a bug. But fortunately for you, there is another way to get the string:
from BeautifulSoup import BeautifulSoup
string = \
"""<h1>foo</h1>
<h2><hr/>bar</h2>"""
soup = BeautifulSoup(string)
bad_tags = soup.findAll('hr');
[tag.extract() for tag in bad_tags]
for i in soup.findAll(['h1', 'h2']):
print i, i.next
# <h1>foo</h1> foo
# <h2>bar</h2> bar
OTHER TIPS
I've got the same problem. I do not know why, but i guess it has to do with the empty elements created by BS.
For example if i have the following code:
from bs4 import BeautifulSoup
html =' \
<a> \
<b test="help"> \
hello there! \
<d> \
now what? \
</d> \
<e> \
<f> \
</f> \
</e> \
</b> \
<c> \
</c> \
</a> \
'
soup = BeautifulSoup(html,'lxml')
#print(soup.find('b').attrs)
print(soup.find('b').contents)
t = soup.find('b').findAll()
#t.reverse()
for c in t:
gb = c.extract()
print(soup.find('b').contents)
soup.find('b').text.strip()
I got the following error:
'NoneType' object has no attribute 'next_element'
On the first print i got:
>>> print(soup.find('b').contents)
[u' ', <d> </d>, u' ', <e> <f> </f> </e>, u' ']
and on the second i got:
>>> print(soup.find('b').contents)
[u' ', u' ', u' ']
I'm pretty sure it is the empty element in the middle creating the problem.
A workaround i found is to just recreate the soup:
soup = BeautifulSoup(str(soup))
soup.find('b').text.strip()
Now it prints:
>>> soup.find('b').text.strip()
u'hello there!'
I hope that helps.