I had issues with encoding on a project and developed a function to get the encoding of the page i was scraping- then you can decode to unicode for your function to try and prevent these errors. with re: to compression what you need to do is develop your code so that if it encounters a compressed file it can deal with it.
from bs4 import BeautifulSoup, UnicodeDammit
import chardet
import re
def get_encoding(soup):
"""
This is a method to find the encoding of a document.
It takes in a Beautiful soup object and retrieves the values of that documents meta tags
it checks for a meta charset first. If that exists it returns it as the encoding.
If charset doesnt exist it checks for content-type and then content to try and find it.
"""
encod = soup.meta.get('charset')
if encod == None:
encod = soup.meta.get('content-type')
if encod == None:
content = soup.meta.get('content')
match = re.search('charset=(.*)', content)
if match:
encod = match.group(1)
else:
dic_of_possible_encodings = chardet.detect(unicode(soup))
encod = dic_of_possible_encodings['encoding']
return encod
a link to deal with compressed data http://www.diveintopython.net/http_web_services/gzip_compression.html
from this question Check if GZIP file exists in Python
if any(os.path.isfile, ['bob.asc', 'bob.asc.gz']):
print 'yay'