Error with Beautiful Soup's extract()

https://stackoverflow.com/questions/855087

21-08-2019
|

Question

I'm working on some screen scraping software and have run into an issue with Beautiful Soup. I'm using python 2.4.3 and Beautiful Soup 3.0.7a.

I need to remove an <hr> tag, but it can have many different attributes, so a simple replace() call won't cut it.

Given the following html:

<h1>foo</h1>
<h2><hr/>bar</h2>

And the following code:

soup = BeautifulSoup(string)

bad_tags = soup.findAll('hr');
[tag.extract() for tag in bad_tags] 

for i in soup.findAll(['h1', 'h2']):
    print i
    print i.string

The output is:

<h1>foo</h1>
foo
<h2>bar</h2>
None

Am I misunderstanding the extract function, or is this a bug with Beautiful Soup?

Solution

It may be a bug. But fortunately for you, there is another way to get the string:

from BeautifulSoup import BeautifulSoup

string = \
"""<h1>foo</h1>
<h2><hr/>bar</h2>"""

soup = BeautifulSoup(string)

bad_tags = soup.findAll('hr');
[tag.extract() for tag in bad_tags] 

for i in soup.findAll(['h1', 'h2']):
    print i, i.next

# <h1>foo</h1> foo
# <h2>bar</h2> bar

OTHER TIPS

I've got the same problem. I do not know why, but i guess it has to do with the empty elements created by BS.

For example if i have the following code:

from bs4 import BeautifulSoup

html ='            \
<a>                \
    <b test="help">            \
        hello there!  \
        <d>        \
        now what?  \
        </d>    \
        <e>        \
            <f>        \
            </f>    \
        </e>    \
    </b>        \
    <c>            \
    </c>        \
</a>            \
'

soup = BeautifulSoup(html,'lxml')
#print(soup.find('b').attrs)

print(soup.find('b').contents)

t = soup.find('b').findAll()
#t.reverse()
for c in t:
    gb = c.extract()

print(soup.find('b').contents)

soup.find('b').text.strip()

I got the following error:

'NoneType' object has no attribute 'next_element'

On the first print i got:

>>> print(soup.find('b').contents)
[u' ', <d> </d>, u' ', <e> <f> </f> </e>, u' ']

and on the second i got:

>>> print(soup.find('b').contents)
[u' ', u' ', u' ']

I'm pretty sure it is the empty element in the middle creating the problem.

A workaround i found is to just recreate the soup:

soup = BeautifulSoup(str(soup))
soup.find('b').text.strip()

Now it prints:

>>> soup.find('b').text.strip()
u'hello there!'

I hope that helps.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow