How to decode cp1252 which is in decimal &#147 instead of \x93?

Question 1

This is what I ended up using

def reformatCp1252(match):
    codePoint = int(match.group(1))

    if 128 <= codePoint <= 159:
        return bytes([codePoint])
    else:
        return match.group()

localPage = urlopen(r_url).read()
formatedPage = re.sub(b'&#(\d+);', reformatCp1252, localPage, flags=re.I)
localSoup = BeautifulSoup(formatedPage, "lxml", from_encoding="windows-1252")

Notes: I am using bs4 with python3.3 in windows7

I discovered that the from_encoding to the BeautifulSoup really doesn't matter, you can put utf-8 or windows-1252 and it gives a full utf-8 encoding replacing windows-1252 encoding to utf-8.
Basically all the codepoints are interpreted as utf-8 and single byte \x? are interpretted as windows-1252.

As far as I know only character from 128 to 159 in the windows-1252 differ from the utf-8 characters.

For example, a mixed encoding (windows-1252 : \x93 and \x94 with utf-8 : Ÿ) will output a transformation in utf-8 only.

byteStream = b'\x93Hello\x94 (\xa7232.405 of this chapter) &#376; \x87'
# with code above
print(localSoup.encode('utf-8'))
# and you can see that \x93 was transformed to its utf-8 equivalent.

Question 2

A numeric character reference in HTML refers to a Unicode codepoint i.e., it doesn't depend on character encoding of the document e.g.,  is U+0094 CANCEL CHARACTER*.

b"\xe2\x80\x9d" bytes interpreted as utf-8 are U+201D RIGHT DOUBLE QUOTATION MARK:

u'\u201d'.encode('utf-8') == b'\xe2\x80\x9d'
u'\u201d'.encode('cp1252') == b'\x94'
u'\u201d'.encode('ascii', 'xmlcharrefreplace') == b'&#8221;'

To fix the code, remove unnecessary bits:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://www.sec.gov/path/to.htm"
soup = BeautifulSoup(urlopen(url))
print(soup)

If it fails; try sys.stdout.buffer.write(soup.encode('cp1252')) or set PYTHONIOENCODING environment variable to cp1252:xmlcharrefreplace.

Question 3

Beautiful soup is interpreting the code-points in the entity, that is the number in, say,  as a Unicode code-point rather than CP-1252 code-points. From the documentation and source for BeautifulSoup 4, it is not clear if there is a way to change this interpretation of HTML entities. (The EntitySubstitution class looked promising but no hooks for customizing it are exposed.)

The following solution is hackey and will only work under the assumption that all non-ASCII (i.e. above code-points 127) characters have been misinterpreted the same way (this will not be the case if there were raw CP-1252 characters in the original, which BeautifulSoup will interpret correctly; this solution will mangle those characters).

Assuming that you have the text from Beautiful Soup's conversion (with the HTML codes interpreted as Unicode code-points ):

soup = BeautifulSoup(page, from_encoding="cp1252")
txt = str(soup)

The following will re-interpret the codes as CP-1252:

def reinterpret_codepoints(chars, encoding='cp1252'):
    '''Converts code-points above 127 in the text to the given
    encoding (assuming that all code-points above 127 represent
    code-points in the given encoding)
    '''
    for char, code in zip(chars, map(ord, txt)):
        if code < 127:
            yield char
        else:
            yield bytes((code,)).decode(encoding)

fixed_text = ''.join(reinterpret_codepoints(txt))

This solution is not optimized for performance, but I think it might be good enough for this particular case.

I extracted all the code-points above 127 from the "fixed" text for the URL you provided in your example. This is what I got (seems to cover the characters that you are interested in):

char | Unicode code-point | CP-1252 code-point | CP-1252 | UTF-8
  |  160 | 160 | b'\xa0' | b'\xc2\xa0'
§ |  167 | 167 | b'\xa7' | b'\xc2\xa7'
¨ |  168 | 168 | b'\xa8' | b'\xc2\xa8'
– | 8211 | 150 | b'\x96' | b'\xe2\x80\x93'
— | 8212 | 151 | b'\x97' | b'\xe2\x80\x94'
’ | 8217 | 146 | b'\x92' | b'\xe2\x80\x99'
“ | 8220 | 147 | b'\x93' | b'\xe2\x80\x9c'
” | 8221 | 148 | b'\x94' | b'\xe2\x80\x9d'
• | 8226 | 149 | b'\x95' | b'\xe2\x80\xa2'