Question

I am currently parsing an html page to extract some information:

Sometimes there is not text after a closing tag such as in the case of Ethos in the HTML document below

<span id= "here" style>
  <br>
  <b> Post Primary</b>
  <b>school<b>
  <br>
  <b>Roll number: </b>
  "60000"
  <br>
  <b>Principal</b>
      "Paul Ince"
  <br>
  <b>Enrolment:</b>
  "Boys; 193 Girls: 190   (2012/13)"
  <br>
  <b>Ethos:</b>
  <b>Catchment:</b>
  "North Inner CIty "
  <br>
 <b>Fees:</b>
 " No "
</span>

I would like to extract the following information

Enrolment= "Boys:193 Girls: 190 (2012/13)"

Ethos= ""

Fees="No"

Was it helpful?

Solution

Here's exactly what you need.

The idea is to define a list of keys/labels you are interested in, find all b elements and check if the text in the b element is in the list of key/labels. If yes - print out the text of b element and the next sibling:

from bs4 import BeautifulSoup

data = """<span id= "here" style>
 <br>
 <b> Post Primary</b>
 <b>school<b>
 <br>
 <b>Roll number: </b>b>
 "60000"
 <br>
 <b>Principal</b>
 "Paul Ince"
 <br>
 <b>Enrolment:</b>
 "Boys; 123 Girls: 102   (2012/13)"
 <br>
 <b>Ethos:</b>
 "Catholic  &nbsp "
 <b>Catchment:</b>
 "North Inner CIty "
 <br>
 <b>Fees:</b>
 " No "
</span>"""

soup = BeautifulSoup(data)

keys = ['Enrolment', 'Ethos', 'Fees']

for element in soup('b'):
    if element.text[:-1] in keys:
        print element.text + element.next_sibling.strip()

prints:

Enrolment:"Boys; 123 Girls: 102   (2012/13)"
Ethos:"Catholic  &nbsp "
Fees:" No "

Hope that helps.

OTHER TIPS

Fixing the closing tags of the <b> elements, you can parse a document like this by noting that the text you are after follows a bolded tag.

import bs4
soup = bs4.BeautifulSoup(A)
data = {}

for item in soup.findAll("b"):
    next_item = item.nextSibling
    data[item.text.strip()] = next_item.string.strip()

print data

Gives a dictionary where you can extract the elements you are looking for:

{u'Ethos:': u'"Catholic  &nbsp "', u'school': u'', u'Fees:': u'" No "', u'Post Primary': u'', u'Roll number:': u'"60000"', u'Catchment:': u'"North Inner CIty "', u'Enrolment:': u'"Boys; 123 Girls: 102   (2012/13)"', u'Principal': u'"Paul Ince"'}

Here's another option. The fact that the document has html issues made it seem to me reasonable to ignore those, and just use the text of the document (BeautifulSoup provides that too). You should determine if the problems with the bold tags are yours or come from the original source.

from bs4 import BeautifulSoup

html = """
<span id= "here" style>
 <br>
  <b> Post Primary</b>
   <b>school<b>
    <br>
     <b>Roll number: </b>b>
    "60000"
<br>
<b>Principal</b>
        "Paul Ince"
        <br>
    <b>Enrolment:</b>
"Boys; 123 Girls: 102   (2012/13)"
<br>
        <b>Ethos:</b>
    "Catholic  &nbsp "
    <b>Catchment:</b>
        "North Inner CIty "
        <br>
        <b>Fees:</b>
            " No "
    </span>
"""

soup = BeautifulSoup(html)
q = soup.text
q = [item for item in q.split('\n') if item!='']
d = {}
for i in range(len(q)):
    if 'Enrolment' in q[i] or 'Ethos' in q[i] or 'Fees' in q[i]:
        d[q[i].strip()] = q[i+1].strip()

print d
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top