This is what I ended up using
def reformatCp1252(match):
codePoint = int(match.group(1))
if 128 <= codePoint <= 159:
return bytes([codePoint])
else:
return match.group()
localPage = urlopen(r_url).read()
formatedPage = re.sub(b'&#(\d+);', reformatCp1252, localPage, flags=re.I)
localSoup = BeautifulSoup(formatedPage, "lxml", from_encoding="windows-1252")
Notes: I am using bs4 with python3.3 in windows7
I discovered that the from_encoding to the BeautifulSoup really doesn't matter, you can put utf-8 or windows-1252 and it gives a full utf-8 encoding replacing windows-1252 encoding to utf-8.
Basically all the codepoints are interpreted as utf-8 and single byte \x? are interpretted as windows-1252.
As far as I know only character from 128 to 159 in the windows-1252 differ from the utf-8 characters.
For example, a mixed encoding (windows-1252 : \x93 and \x94 with utf-8 : Ÿ) will output a transformation in utf-8 only.
byteStream = b'\x93Hello\x94 (\xa7232.405 of this chapter) Ÿ \x87'
# with code above
print(localSoup.encode('utf-8'))
# and you can see that \x93 was transformed to its utf-8 equivalent.