Pergunta

Update: I have changed the encoding to

with open("../data/enwiki-20131202-pagelinks.sql", encoding="ISO-8859-1")

...and the program is now chewing through the file without complaint. Maybe the SQL dumps aren't UTF-8 and don't contain such literals, a false assumption on my part.

Original:

I'm trying to process one of Wikipedia's humongous data sets, namely the pagelinks.sql file.

Unfortunately I get the following error while reading the file:

(...)
File "c:\Program Files\Python 3.3\lib\codecs.py", line 301, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 5095: invalid start byte

My code is as follows:

import re

reg1 = re.compile(",0,")
ref_count = 0
with open("../data/enwiki-20131202-pagelinks.sql", encoding="utf8") as infile:
    for line in infile:
        matches = re.findall(reg1, line)
        ref_count += len(matches)

print ("found", ref_count, "references.")
Foi útil?

Solução

An excerpt from a comment under the "Unicode" heading here http://meta.wikimedia.org/wiki/Data_dumps/Dump_format may be helpful:

"The dumps may contain non-Unicode (UTF8) characters in older text revisions due to lenient charset validation in the earlier MediaWiki releases..."

Ignoring for the moment the conflation of Unicode and UTF8 what you can do to avoid the error is pass the errors keyword argument to open(), e.g.:

filepath = "../data/enwiki-20131202-pagelinks.sql" 
with open(filepath, encoding="utf8", errors='replace') as infile:
    ...

That "causes a replacement marker (such as ?) to be inserted where there is malformed data." http://docs.python.org/3/library/functions.html#open

If you'd rather ignore the non-UTF8 characters you can use errors='ignore'.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top