Question

I m trying to scrape links from the search results of Yahoo with the following python code. I m using mechanize to for browser instance and Beautiful soup for parsing the HTML code.

The problem is, this script would work fine sometimes and sometimes throws following error:

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

Its clear that its something related to encoding and decoding or gzip compression I guess, but why working sometimes and sometimes not? and how it could be fixed to work all the time?

Following is the code. Run it 7-8 times and you will notice.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import mechanize
import urllib
from bs4 import BeautifulSoup
import re

#mechanize emulates a Browser
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent','chrome')]

term = "stock market".replace(" ","+")
query = "https://search.yahoo.com/search?q=" + term

htmltext = br.open(query).read()
htm = str(htmltext)

soup = BeautifulSoup(htm)
#Since all results are located in the ol tag
search = soup.findAll('ol')

searchtext = str(search)

#Using BeautifulSoup to parse the HTML source
soup1 = BeautifulSoup(searchtext)
#Each search result is contained within div tag
list_items = soup1.findAll('div', attrs={'class':'res'})


#List of first search result
list_item = str(list_items)

for li in list_items:
    list_item = str(li)
    soup2 = BeautifulSoup(list_item)
    link = soup2.findAll('a')
    print link[0].get('href')
    print ""

Here's an output screenshot: http://pokit.org/get/img/1d47e0d0dc08342cce89bc32ae6b8e3c.jpg

Was it helpful?

Solution

I had issues with encoding on a project and developed a function to get the encoding of the page i was scraping- then you can decode to unicode for your function to try and prevent these errors. with re: to compression what you need to do is develop your code so that if it encounters a compressed file it can deal with it.

from bs4 import BeautifulSoup, UnicodeDammit
import chardet
import re

def get_encoding(soup):
    """
    This is a method to find the encoding of a document.
    It takes in a Beautiful soup object and retrieves the values of that documents meta tags
    it checks for a meta charset first. If that exists it returns it as the encoding.
    If charset doesnt exist it checks for content-type and then content to try and find it.
    """
    encod = soup.meta.get('charset')
    if encod == None:
        encod = soup.meta.get('content-type')
        if encod == None:
            content = soup.meta.get('content')
            match = re.search('charset=(.*)', content)
            if match:
                encod = match.group(1)
            else:
                dic_of_possible_encodings = chardet.detect(unicode(soup))
                encod = dic_of_possible_encodings['encoding'] 
    return encod

a link to deal with compressed data http://www.diveintopython.net/http_web_services/gzip_compression.html

from this question Check if GZIP file exists in Python

if any(os.path.isfile, ['bob.asc', 'bob.asc.gz']):
    print 'yay'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top