BeautifulSoup gives me unicode+html symbols, rather than straight up unicode. Is this a bug or misunderstanding?
-
08-07-2019 - |
Question
I'm using BeautifulSoup to scrape a website. The website's page renders fine in my browser:
Oxfam International’s report entitled “Offside! http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271
In particular, the single and double quotes look fine. They look html symbols rather than ascii, though strangely when I view source in FF3 they appear to be normal ascii.
Unfortunately, when I scrape I get something like this
u'Oxfam International\xe2€™s report entitled \xe2€œOffside!
oops, I mean this:
u'Oxfam International\xe2€™s report entitled \xe2€œOffside!
The page's meta data indicates 'iso-88959-1' encoding. I've tried different encodings, played with unicode->ascii and html->ascii third party functions, and looked at the MS/iso-8859-1 discrepancy, but the fact of the matter is that ™ has nothing to do with a single quote, and I can't seem to turn the unicode+htmlsymbol combo into the right ascii or html symbol--in my limited knowledge, which is why I'm seeking help.
I'd be happy with an ascii double quote, " or "
The problem the following is that I'm concerned there are other funny symbols decoded incorrectly.
\xe2€™
Below is some python to reproduce what I'm seeing, followed by the things I've tried.
import twill
from twill import get_browser
from twill.commands import go
from BeautifulSoup import BeautifulSoup as BSoup
url = 'http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271'
twill.commands.go(url)
soup = BSoup(twill.commands.get_browser().get_html())
ps = soup.body("p")
p = ps[52]
>>> p
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 22: ordinal not in range(128)
>>> p.string
u'Oxfam International\xe2€™s report entitled \xe2€œOffside!<elided>\r\n'
http://www.fourmilab.ch/webtools/demoroniser/
http://www.crummy.com/software/BeautifulSoup/documentation.html
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
>>> AsciiDammit.asciiDammit(p.decode())
u'<p>Oxfam International\xe2€™s report entitled \xe2€œOffside!
>>> handle_html_entities(p.decode())
u'<p>Oxfam International\xe2\u20ac\u2122s report entitled \xe2\u20ac\u0153Offside!
>>> unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')
'<p>Oxfam International€™s report entitled €œOffside!
>>> htmlStripEscapes(p.string)
u'Oxfam International\xe2TMs report entitled \xe2Offside!
EDIT:
I've tried using a different BS parser:
import html5lib
bsoup_parser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("beautifulsoup"))
soup = bsoup_parser.parse(twill.commands.get_browser().get_html())
ps = soup.body("p")
ps[55].decode()
which gives me this
u'<p>Oxfam International\xe2\u20ac\u2122s report entitled \xe2\u20ac\u0153Offside!
the best case decode seems to give me the same results:
unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')
'<p>Oxfam InternationalTMs report entitled Offside!
EDIT 2:
I am running Mac OS X 4 with FF 3.0.7 and Firebug
Python 2.5 (wow, can't believe I didn't state this from the beginning)
Solution
That's one seriously messed up page, encoding-wise :-)
There's nothing really wrong with your approach at all. I would probably tend to do the conversion before passing it to BeautifulSoup, just because I'm persnickity:
import urllib
html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()
h = html.decode('iso-8859-1')
soup = BeautifulSoup(h)
In this case, the page's meta tag is lying about the encoding. The page is actually in utf-8... Firefox's page info reveals the real encoding, and you can actually see this charset in the response headers returned by the server:
curl -i http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271
HTTP/1.1 200 OK
Connection: close
Date: Tue, 10 Mar 2009 13:14:29 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Set-Cookie: COMPANYID=271;path=/
Content-Language: en-US
Content-Type: text/html; charset=UTF-8
If you do the decode using 'utf-8', it will work for you (or, at least, is did for me):
import urllib
html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()
h = html.decode('utf-8')
soup = BeautifulSoup(h)
ps = soup.body("p")
p = ps[52]
print p
OTHER TIPS
It's actually UTF-8 misencoded as CP1252:
>>> print u'Oxfam International\xe2€™s report entitled \xe2€œOffside!'.encode('cp1252').decode('utf8')
Oxfam International’s report entitled “Offside!