Domanda

I have a string containing what I guess you'd call a "special" character (o with an umlaut above it) and it's throwing off a DBF library I am using (Ethan Furman's Python DBF library https://pypi.python.org/pypi/dbf retrieve_character() function, error on last line of the function is 'ascii' codec can't decode byte 0xf6 in position 6: ordinal not in range(128) ).

The code:

def retrieve_character(bytes, fielddef, memo, decoder):
    """
    Returns the string in bytes as fielddef[CLASS] or fielddef[EMPTY]
    """
    data = bytes.tostring()
    if not data.strip():
        cls = fielddef[EMPTY]
        if cls is NoneType:
            return None
        return cls(data)
    if fielddef[FLAGS] & BINARY:
        return data
    return fielddef[CLASS](decoder(data)[0]) #error on this line
È stato utile?

Soluzione

dbf files have a codepage attribute. It sounds like it has not been correctly set with your file. Do you know which code page was used to create the data? If so, you can override the dbf's setting when you open the file:

table = dbf.Table('dbf_file', codepage='cp437')

cp437 is just an example -- use whatever is appropriate.

To see the current codepage of a dbf file (assuming you didn't override on opening) use:

table.codepage

If you specify the wrong codepage when you open the file, then the non-ascii data could be incorrect (e.g. your o with umlaut may end up as an n with tilde).

Altri suggerimenti

Have you tried using unicodeData.encode('ascii', 'ignore')? This will convert your umlaut to an o while ignoring any conversion errors between encoding formats.

There is my way. dbf code: http://dbf-software.com/dbf-file-encoding.html you can use re.findall to get all codepage.##

  1. Heading
 ##
Windows Encodings:
874 Thai Windows
932 Japanese Windows
936 Chinese (PRC, Singapore) Windows
949 Korean Windows
950 Chinese (Hong Kong SAR, Taiwan) Windows
1250 Eastern European Windows
1251 Russian Windows
1252 Windows ANSI
1253 Greek Windows
1254 Turkish Windows
1255 Hebrew Windows
1256 Arabic Windows
MS-DOS Encodings:
437 U.S. MS-DOS
620 Mazovia (Polish) MS-DOS
737 Greek MS-DOS (437G)
850 International MS-DOS
852 Eastern European MS-DOS
857 Turkish MS-DOS
861 Icelandic MS-DOS
865 Nordic MS-DOS
866 Russian MS-DOS
895 Kamenicky (Czech) MS-DOS

Pseudo-code:

import dbf

codepage_list = ['936', '437', ...]

for codepage in codepage_list:

    tabel = dbf.Table('mydbf.dbf', codepage='cp{}'.format(codepage))
    tabel.open(dbf.READ_WRITE)
    try:
        for row in table: 
            print(row)
        table.close()
    except UnicodeDecodeError:
        print('wrong codepage', codepage)
        tabel.close()
        continue
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top