how to deal with japanese word using python xlrd

https://stackoverflow.com/questions/6069476

python
xlrd

07-09-2020
|

Question

this is my code:

#!/usr/bin/python   
#-*-coding:utf-8-*-   

import xlrd,sys,re

data = xlrd.open_workbook('a.xls',encoding_override="utf-8")
a = data.sheets()[0]
s=''
for i in range(a.nrows):
    if 9<i<20:
        #stage
        print a.row_values(i)[1].decode('shift_jis')+'\n'

but it show :

????
????????
??????
????
????
????
????????

so what can i do ,

thanks

Solution

Background: In a "modern" (Excel 97-2003) XLS file, text is effectively stored as Unicode. In older files, text is stored as 8-bit strings, and a "codepage" record tells how it is encoded e.g. the integer 1252 corresponds to the encoding known as cp1252 or windows-1252. In either case, xlrd presents extracted text as unicode objects.

Please insert this line into your code:

print data.biff_version, data.codepage, data.encoding

If you have a new file, you should see

80 1200 utf_16_le

In any case, please edit your question to report the outcome.

Problem 1: encoding_override is required ONLY if the file is an old file AND you know/suspect that the codepage record is omitted or wrong. It is ignored if the file is a new file. Do you really know that the file is pre-Excel-97 and the text is encoded in UTF-8? If so, it can only have been created by some seriously deluded 3rd-party software, and Excel will blow up if you try to open it with Excel; visit the author with a baseball bat. Otherwise, don't use encoding_override.

Problem 2: You should have unicode objects. To display them, you need to encode (not decode) them from unicode to str using a suitable encoding. It is very suprising that print unicode_object.decode('shift-jis') doesn't raise an exception and prints question marks.

To help understand this, please change your code to be like this:

text = a.rowvalues(i)[1]
print i, repr(text)
print repr(text.decode('shift-jis'))

and report the outcome.

So that we can help you choose an appropriate encoding (if any), tell us what version of what operating system you are using, and what the following display:

print sys.stdout.encoding
import locale
print locale.getpreferredencoding()

Further reading:

(1) the xlrd documentation (section on Unicode, right up the front) ... included in the distribution, or get the latest commit here.

(2) the Python Unicode HOWTO.

OTHER TIPS

Why isn't your encoding override on open shift-jis?

data = xlrd.open_workbook('a.xls',encoding_override="shift-jis")

If the file is really shift-JIS, there are lots of code points (well frankly, almost all of them) that don't overlap with valid UTF-8 code points. If you are getting illegal characters (?) and your file is really UTF-8 and you want to output Shift-JIS, might I suggest that your output shell (for print - probably a file would be fine) can't handle the encoding.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow