Find All text within 1 level in HTML using Beautiful Soup - Python

https://stackoverflow.com/questions/17030605

31-05-2022
|

Question

I need to use beautiful soup to accomplish the following

Example HTML

<div id = "div1">
 Text1
 <div id="div2>
   Text2
   <div id="div3">
    Text3
   </div>
 </div>
</div>

I need to do a search over this to return to me in separate instances of a list

Text1
Text2
Text3

I tried doing a findAll('div'), but it repeated the same Text multiple times ie it would return

Text1 Text2 Text3
Text2 Text3
Text3

Solution

Well, you problem is that .text also includes text from all the child nodes. You'll have to manually get only those text nodes that are immediate children of a node. Also, there might be multiple text nodes inside a given one, for example:

<div>
    Hello
        <div>
            foobar
        </div>
    world!
</div>

How do you want them to be concatenated? Here is a function that joins them with a space:

def extract_text(node):
    return ' '.join(t.strip() for t in node(text=True, recursive=False))

With my example:

In [27]: t = """
<div>
    Hello
        <div>
            foobar
        </div>
    world!
</div>"""

In [28]: soup = BeautifulSoup(t)

In [29]: map(extract_text, soup('div'))
Out[29]: [u'Hello world!', u'foobar']

And your example:

In [32]: t = """
<div id = "div1">
 Text1
 <div id="div2">
   Text2
   <div id="div3">
    Text3
   </div>
 </div>
</div>"""

In [33]: soup = BeautifulSoup(t)

In [34]: map(extract_text, soup('div'))
Out[34]: [u'Text1 ', u'Text2 ', u'Text3']

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow