information lost when using beautifulsoup to parse a html page

https://stackoverflow.com/questions/16410833

14-04-2022
|

Question

I'm writing a web spider to get some information from a website. when I parse this page http://www.tripadvisor.com/Hotels-g294265-oa120-Singapore-Hotels.html#ACCOM_OVERVIEW , I find that some information are lost, I print the html doc using soup.prettify()，and the html doc is not the same with the doc I get using urllib2.openurl(), something is lost. Codes are as following:

    htmlDoc = urllib2.urlopen(sourceUrl).read()
    soup = BeautifulSoup(htmlDoc)

    subHotelUrlTags = soup.findAll(name='a', attrs={'class' : 'property_title'})
    print len(subHotelUrlTags)
    #if len(subHotelUrlTags) != 30:
    #   print soup.prettify()
    for hotelUrlTag in subHotelUrlTags:
        hotelUrls.append(website + hotelUrlTag['href'])

I try to using HtmlParser to do the same thing, it prints out the following errors:

 Traceback (most recent call last):
 File "./spider_new.py", line 47, in <module>
 hotelUrls = getHotelUrls()
 File "./spider_new.py", line 40, in getHotelUrls
 hotelParser.close()
 File "/usr/lib/python2.6/HTMLParser.py", line 112, in close
 self.goahead(1)
 File "/usr/lib/python2.6/HTMLParser.py", line 164, in goahead
 self.error("EOF in middle of construct")
 File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
 raise HTMLParseError(message, self.getpos())
 HTMLParser.HTMLParseError: EOF in middle of construct, at line 3286, column 1

Solution

Download and install lxml..

It can parse such "faulty" webpages. (The HTML is probably broken in some weird way, and Python's HTML parser isn't great at understanding that sort of thing, even with bs4's help.)

Also, you don't need to change your code if you install lxml, BeautifulSoup will automatically pick up lxml and use it to parse the HTML instead.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow