python code returns none type object has no attribute error sometimes and works perfectly the other time

https://stackoverflow.com/questions/19535520

01-07-2022
|

Question

def dcrawl(link):
    #importing the req. libraries & modules
    from bs4 import BeautifulSoup
    import urllib

    #fetching the document
    op = urllib.FancyURLopener({})
    f = op.open(link)
    h_doc = f.read()

    #trimming for the base document
    idoc1 = BeautifulSoup(h_doc)
    idoc2 = str(idoc1.find(id = "bwStory"))
    bdoc = BeautifulSoup(idoc2)

    #extract the date as a string
    dat = str(bdoc.div.div.string)[0:13]
    date = dst(dat)

    #extract the title as a string
    title = str(bdoc.b.string)
    #extract the full report as a string
    freport = str(bdoc.find_all("p"))

    #extract the place as a string
    plc = bdoc.find(id = "bwStoryBody")
    puni = plc.p.string

    #encoding to ascii to eliminate discrepancies
    pasi = puni.encode('ascii', 'ignore')
    com = pasi.find("-")
    place = pasi[:com]

the same conversion "bdoc.b.string" works here:

#extract the full report as a string
freport = str(bdoc.find_all("p"))

In the line:

plc = bdoc.find(id = "bwStoryBody")

plc returns some data. and plc.p returns the first <p>....<p>, but converting it to string doesn't work.

because puni returned a string object earlier, I stumbled upon unicode errors and so had to use the encode to handle the pasi result.

Solution

.find() returns None when an object was not found. Evidently some pages do not have the elements that you are looking for.

Test for it explicitly if you want to prevent attribute errors:

plc = bdoc.find(id = "bwStoryBody")
if plc is not None:
    puni = plc.p.string
    #encoding to ascii to eliminate discrepancies
    #By default python processes in unicode
    pasi = puni.encode('ascii', 'ignore')
    com = pasi.find("-")
    place = pasi[:com]

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow