Parsing an html page using beautifulsoup/python

Question 1

Here's exactly what you need.

The idea is to define a list of keys/labels you are interested in, find all b elements and check if the text in the b element is in the list of key/labels. If yes - print out the text of b element and the next sibling:

from bs4 import BeautifulSoup

data = """<span id= "here" style>
 <br>
 <b> Post Primary</b>
 <b>school<b>
 <br>
 <b>Roll number: </b>b>
 "60000"
 <br>
 <b>Principal</b>
 "Paul Ince"
 <br>
 <b>Enrolment:</b>
 "Boys; 123 Girls: 102   (2012/13)"
 <br>
 <b>Ethos:</b>
 "Catholic  &nbsp "
 <b>Catchment:</b>
 "North Inner CIty "
 <br>
 <b>Fees:</b>
 " No "
</span>"""

soup = BeautifulSoup(data)

keys = ['Enrolment', 'Ethos', 'Fees']

for element in soup('b'):
    if element.text[:-1] in keys:
        print element.text + element.next_sibling.strip()

prints:

Enrolment:"Boys; 123 Girls: 102   (2012/13)"
Ethos:"Catholic  &nbsp "
Fees:" No "

Hope that helps.

Question 2

Fixing the closing tags of the <b> elements, you can parse a document like this by noting that the text you are after follows a bolded tag.

import bs4
soup = bs4.BeautifulSoup(A)
data = {}

for item in soup.findAll("b"):
    next_item = item.nextSibling
    data[item.text.strip()] = next_item.string.strip()

print data

Gives a dictionary where you can extract the elements you are looking for:

{u'Ethos:': u'"Catholic  &nbsp "', u'school': u'', u'Fees:': u'" No "', u'Post Primary': u'', u'Roll number:': u'"60000"', u'Catchment:': u'"North Inner CIty "', u'Enrolment:': u'"Boys; 123 Girls: 102   (2012/13)"', u'Principal': u'"Paul Ince"'}

Question 3

Here's another option. The fact that the document has html issues made it seem to me reasonable to ignore those, and just use the text of the document (BeautifulSoup provides that too). You should determine if the problems with the bold tags are yours or come from the original source.

from bs4 import BeautifulSoup

html = """
<span id= "here" style>
 <br>
  <b> Post Primary</b>
   <b>school<b>
    <br>
     <b>Roll number: </b>b>
    "60000"
<br>
<b>Principal</b>
        "Paul Ince"
        <br>
    <b>Enrolment:</b>
"Boys; 123 Girls: 102   (2012/13)"
<br>
        <b>Ethos:</b>
    "Catholic  &nbsp "
    <b>Catchment:</b>
        "North Inner CIty "
        <br>
        <b>Fees:</b>
            " No "
    </span>
"""

soup = BeautifulSoup(html)
q = soup.text
q = [item for item in q.split('\n') if item!='']
d = {}
for i in range(len(q)):
    if 'Enrolment' in q[i] or 'Ethos' in q[i] or 'Fees' in q[i]:
        d[q[i].strip()] = q[i+1].strip()

print d