An excerpt from a comment under the "Unicode" heading here http://meta.wikimedia.org/wiki/Data_dumps/Dump_format may be helpful:
"The dumps may contain non-Unicode (UTF8) characters in older text revisions due to lenient charset validation in the earlier MediaWiki releases..."
Ignoring for the moment the conflation of Unicode and UTF8 what you can do to avoid the error is pass the errors
keyword argument to open()
, e.g.:
filepath = "../data/enwiki-20131202-pagelinks.sql"
with open(filepath, encoding="utf8", errors='replace') as infile:
...
That "causes a replacement marker (such as ?
) to be inserted where there is malformed data." http://docs.python.org/3/library/functions.html#open
If you'd rather ignore the non-UTF8 characters you can use errors='ignore'
.