Frage

I am trying to scrape text from the web using BeautifulSoup 4 to parse it out. I am running into an issue when printing bs4 processed text out to the console. Whenever I hit a character that was originally an HTML entity, like ’ I get garbage characters on the console. I believe bs4 is converting these entities to unicode correctly because if I try using another encoding to print out the text, it will complain about the appropriate lack of unicode mapping for a character (like u'\u2019.) I'm not sure why the print function gets confused over these characters. I've tried changing around fonts, which changes the garbage characters, and am on a Windows 7 machine with US-English locale. Here is my code for reference, any help is appreciated. Thanks in advance!

#!/usr/bin/python
import json
import urllib2
import cookielib
import bs4

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Tiguan\
&page=0&api-key=blah"
response = opener.open(url)
articles = response.read()
decoded = json.loads(articles)

totalpages = decoded['response']['meta']['hits']/10

for page in range(totalpages + 1):
    if page>0:
        url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?\
q=Tiguan&page=" + str(page) + "&api-key=blah"
        response = opener.open(url)
        articles = response.read()
        decoded = json.loads(articles)
    for url in decoded['response']['docs']:
        print url['web_url']
        urlstring = url['web_url']
        art = opener.open(urlstring)
        soup = bs4.BeautifulSoup(art.read())
        goodstuff = soup.findAll('nyt_text')
        for tag in goodstuff:
            print tag.prettify().encode("UTF")
War es hilfreich?

Lösung

The problem has nothing to do with bs4, or HTML entities, or anything else. You could reproduce the exact same behavior, on most Windows systems, with a one-liner program to print out the same characters that are appearing as garbage when you try to print them, like this:

print u'\u2019'.encode('UTF-8')

The problem here is that, like the vast majority of Windows systems (and nothing else anyone uses in 2013), your default character set is not UTF-8, but something like CP1252.

So, when you encode your Unicode strings to UTF-8 and print those bytes to the console, the console interprets them as CP1252. Which, in this case, means you get ’ instead of .

Changing fonts won't help. The UTF-8 encoding of \u2013 is the three bytes \xe2, \x80, and \x99, and the CP1252 meaning of those three bytes is â, , and .

If you want to encode manually for the console, you need to encode to the right character set, the one your console actually uses. You may be able to get that as sys.stdout.encoding.

Of course you may get an exception trying to encode things for the right character set, because 8-bit character sets like CP1252 can only handle about 240 of the 110K characters in Unicode. The only way to handle that is to use the errors argument to encode to either ignore them or replace them with replacement characters.

Meanwhile, if you haven't read the Unicode HOWTO, you really need to. Especially if you plan to stick with Python 2.x and Windows.


If you're wondering why a few command-line programs seem to be able to get around these problems: Microsoft's solution to the character set problem is to create a whole parallel set of APIs that use 16-bit characters instead of 8-bit, and those APIs always use UTF-16. Unfortunately, many things, like the portable stdio wrappers that Microsoft provides for talking to the console and that Python 2.x relies on, only have the 8-bit API. Which means the problem isn't solved at all. Python 3.x no longer uses those wrappers, and there have been recurring discussions on making some future version talk UTF-16 to the console. But even if that happens in 3.4 (which seems very unlikely), that won't help you as long as you're using 2.x.

Andere Tipps

@abarnert's answer contains a good explanation of the issue.

In your particular case, you could just pass encoding parameter to prettify() instead of default utf-8.

If you are printing to console, you could try to print Unicode directly:

print soup.prettify(encoding=None, formatter='html') # print Unicode

It may fail. If you pass ascii; then BeautifulSoup may use numerical character references instead of non-ascii characters:

print soup.prettify('ascii', formatter='html')

It assumes that current Windows codepage is ascii-based encoding (most of them do). It should also work if the output is redirected to a file or another program via a pipe.

For portability, you could always print Unicode (encoding=None above) and use PYTHONIOENCODING to get appropriate character encoding e.g., utf-8 for files, pipes and ascii:xmlcharrefreplace to avoid garbage in a console.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top