Extracting tag content based on content value using BeautifulSoup

https://stackoverflow.com/questions/8909690

17-04-2021
|

Question

I have a Html document of the following format.

<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>

I want to extract the content of paragraph tag, including the content of italic and bold tag but not the content of anchor tag. Also, possible ignoring the Numeric in the beginning.

The expected output is: Content of the paragraph in italic but not strong.

What is the best way to do it?

Also, the following code snippet returns TypeError: argument of type 'NoneType' is not iterable

soup = BSoup(page)
for p in soup.findAll('p'):
    if '&nbsp;&nbsp;&nbsp;' in p.string:
        print p

Thanks for the suggestions.

Solution

Your code fails because tag.string is set if the tag has only one child and that child is NavigableString

You can achieve what you want by extracting the a tag:

from BeautifulSoup import BeautifulSoup

s = """<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>"""
soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)

for p in soup.findAll('p'):
    for a in p.findAll('a'):
        a.extract()
    print ''.join(p.findAll(text=True))

OTHER TIPS

The problem you're having regarding string is because string is, as explained in the documentation, only available:

if a tag has onnly one child node, and that child node is a string

Hence, in your case p.string is None and you can't iterate over it. To get access to a tag contents you have to use p.contents (this is a list that includes the tags) or p.text (this is a string with all the tags removed).

In your case, you're probably looking for something like this:

>>> ''.join([str(e) for e in soup.p.contents
                    if not isinstance(e, BeautifulSoup.Tag)
                       or e.name != 'a'])
>>> '&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> .'

If you need to also remove the `' ' prefix, I'd use a regular expression to remove that part from the final string.

I think you would just have to iterate through the tags inside p and collect the desired strings.

Using lxml, you could use XPath:

import lxml.html as LH
import re

content = '''\
<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>'''

doc = LH.fromstring(content)
ptext = ''.join(doc.xpath('//p/descendant-or-self::*[not(self::a)]/text()'))
pat = r'^.*\d+.\s*'
print(re.sub(pat,'',ptext))

yields:

Content of the paragraph  in italic  but not  strong  .

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string. (given in the documentation in the link above)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow