Select specific child elements with BeautifulSoup

https://stackoverflow.com/questions/1571699

21-09-2019
|

Question

I'm reading up on BeautifulSoup to screen-scrape some pretty heavy html pages. Going through the documentation of BeautifulSoup I can't seem to find a easy way to select child elements.

Given the html:

<div id="top">
  <div>Content</div>
  <div>
    <div>Content I Want</div>
  </div>
</div>

I want a easy way to to get the "Content I Want" given I have the object top. Coming to BeautifulSoup I thought it would be easy, and something like topobj.nodes[1].nodes[0].string. Instead I only see variables and functions that also return the elements together with text nodes, comments and so on.

Am I missing something? Or do I really need to resort to a long form using .find() or even worse using list comphrensions on the .contents variable.

The reason is that I don't trust the whitespace of the webpage to be the same so I want to ignore it and only traverse on elements.

Solution

You are more flexible with find, and to get what you want you just need to run:

node = p.find('div', text="Content I Want")

But since it might not be how you want to get there, following options might suit you better:

xml = """<div id="top"><div>Content</div><div><div>Content I Want</div></div></div>"""
from BeautifulSoup import BeautifulSoup
p = BeautifulSoup(xml)

# returns a list of texts
print p.div.div.findNextSibling().div.contents
# returns a list of texts
print p.div.div.findNextSibling().div(text=True)
# join (and strip) the values
print ''.join(s.strip() for s in p.div.div.findNextSibling().div(text=True))

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow