There are many ways to solve your problem. I chose to iterate over the h2 elements in a loop, then over their siblings in another loop. I break out of the inner loop, when I encounter another h2. I did not remove whitespace. You can do that with Python methods such as rtrim
and ltrim
. You can get rid of the "DOB:" with string.replace
.
from bs4 import BeautifulSoup
from bs4 import NavigableString
s = """your HTML here"""
soup = BeautifulSoup(s)
headers = soup.find_all("h2")
for h in headers:
print h.text
for s in h.next_siblings:
if s.name == "h2":
break
elif isinstance(s, NavigableString):
print s.string