Reading and writing non-English characters from websites with python

https://stackoverflow.com/questions/20954615

24-09-2022
|

Question

I'm doing a bit of data scraping on Wikipedia, and I want to read certain entries. I'm using the urllib.urlopen('http://www.example.com') and urllib.read()

This works fine until it encounters non English characters like Stanislav Šesták Here's are the first few lines:

import urllib

print urllib.urlopen("http://en.wikipedia.org/wiki/Stanislav_Šesták").read()

result:

<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" /><title>Stanislav ֵ estֳ¡k - Wikipedia, the free encyclopedia</title>
<meta name="generator" content="MediaWiki 1.23wmf8" />
<link rel="alternate" type="application/x-wiki" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&amp;action=edit" />
<link rel="edit" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&amp;action=edit" />
<link rel="apple-touch-icon" href="//bits.wikimedia.org/apple-touch/wikipedia.png" />

How can I retain the non-English characters? In the end this code will write the entry title and the URL in a .txt file.

Solution

There are multiple issues:

non-ascii characters in a string literal: you must specify encoding declaration at the top of the module in this case
you should urlencode the url path (u"Stanislav_Šesták" -> "Stanislav_%C5%A0est%C3%A1k")
you are printing bytes received from a web to your terminal. Unless both use the same character encoding then you might see garbage instead of some characters
to interpret html, you should probably use an html parser

Here's a code example that takes into account the above remarks:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import cgi
import urllib
import urllib2

wiki_title = u"Stanislav_Šesták"
url_path = urllib.quote(wiki_title.encode('utf-8'))
r = urllib2.urlopen("https://en.wikipedia.org/wiki/" + url_path)
_, params = cgi.parse_header(r.headers.get('Content-Type', ''))
encoding = params.get('charset')
content = r.read()
unicode_text = content.decode(encoding or 'utf-8')
print unicode_text # if it fails; set PYTHONIOENCODING

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow