Question

I have this DOM:

<h2>Main Section</h2>
<p>Bla bla bla<p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>


<h2>Main Section 2</h2>
<p>bla</p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>

I'd like to generate an iterator that returns 'Main Section', 'Bla bla bla', 'Subsection', etc. Is there a way to this with BeautifulSoup?

Was it helpful?

Solution

Here's one way to do it. The idea is to iterate over main sections (h2 tag) and for every h2 tag iterate over siblings until next h2 tag:

from bs4 import BeautifulSoup, Tag


data = """<h2>Main Section</h2>
<p>Bla bla bla<p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>


<h2>Main Section 2</h2>
<p>bla</p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>"""


soup = BeautifulSoup(data)
for main_section in soup.find_all('h2'):
    for sibling in main_section.next_siblings:
        if not isinstance(sibling, Tag):
            continue
        if sibling.name == 'h2':
            break
        print sibling.text
    print "-------"

prints:

Bla bla bla


Subsection
Some more info
Subsection 2
Even more info!
-------
bla
Subsection
Some more info
Subsection 2
Even more info!
-------

Hope that helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top