Question

I'm doing a bit of data scraping on Wikipedia, and I want to read certain entries. I'm using the urllib.urlopen('http://www.example.com') and urllib.read()

This works fine until it encounters non English characters like Stanislav Šesták Here's are the first few lines:

import urllib

print urllib.urlopen("http://en.wikipedia.org/wiki/Stanislav_Šesták").read()

result:

<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" /><title>Stanislav ֵ estֳ¡k - Wikipedia, the free encyclopedia</title>
<meta name="generator" content="MediaWiki 1.23wmf8" />
<link rel="alternate" type="application/x-wiki" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&amp;action=edit" />
<link rel="edit" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&amp;action=edit" />
<link rel="apple-touch-icon" href="//bits.wikimedia.org/apple-touch/wikipedia.png" />

How can I retain the non-English characters? In the end this code will write the entry title and the URL in a .txt file.

Was it helpful?

Solution

There are multiple issues:

  • non-ascii characters in a string literal: you must specify encoding declaration at the top of the module in this case
  • you should urlencode the url path (u"Stanislav_Šesták" -> "Stanislav_%C5%A0est%C3%A1k")
  • you are printing bytes received from a web to your terminal. Unless both use the same character encoding then you might see garbage instead of some characters
  • to interpret html, you should probably use an html parser

Here's a code example that takes into account the above remarks:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import cgi
import urllib
import urllib2

wiki_title = u"Stanislav_Šesták"
url_path = urllib.quote(wiki_title.encode('utf-8'))
r = urllib2.urlopen("https://en.wikipedia.org/wiki/" + url_path)
_, params = cgi.parse_header(r.headers.get('Content-Type', ''))
encoding = params.get('charset')
content = r.read()
unicode_text = content.decode(encoding or 'utf-8')
print unicode_text # if it fails; set PYTHONIOENCODING

Related:

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top