Pergunta

When printing out DB2 query results I'm getting the following error on column 'F00002' which is a binary array.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)

I am using the following line:

print result[2].decode('cp037')

...just as I do the first two columns where the same code works fine. Why is this not working on the third column and what is the proper decoding/encoding?

enter image description here enter image description here

Foi útil?

Solução

Notice that the error is about encoding to ASCII, not about decoding from cp307. But you're not asking it to encode anywhere, so why is this happening?

Well, there are actually two possible places this could go wrong, and we can't know which of them it is without some help from you.


First, if your result[2] is already a unicode object, calling decode('cp037') on it will first try to encode it with sys.getdefaultencoding(), which is usually 'ascii', so that it has something to decode. So, instead of getting an error saying "Hey, bozo, I'm already decoded", you get an error about encoding to ASCII failing. (This may seem very silly, but it's useful for a handful of codecs that can decode unicode->unicode or unicode->str, like ROT13 and quoted-printable.)

If this is your problem, the solution is to not call decode. You've presumably already decoded the data somewhere along the way to this point, so don't try to do it again. (If you've decoded it wrong, you need to figure out where you decoded it and fix that to do it right; re-decoding it after it's already wrong won't help.)


Second, passing a Unicode string to print will automatically try to encode it with (depending on your Python version) either sys.getdefaultencoding() or sys.stdout.encoding. If Python has failed to guess the right encoding for your console (pretty common on Windows), or if you're redirecting your script's stdout to a file instead of printing to the console (which means Python can't possibly guess the right encoding), you can end up with 'ascii' even in sys.stdout.encoding.

If this is your problem, you have to explicitly specify the right encoding for your console (if you're lucky, it's in sys.stdout.encoding), or the encoding you want for the text file you're redirecting to (probably 'utf-8', but that's up to you), and explicitly encode everything you print.


So, how do you know which one of these it is?

Simple. print type(result[2]) and see whether it's a unicode or a str. Or break it up into two pieces: x = result[2].decode('cp037') and then print x, and see which of the two raises. Or run in a debugger. You have all kinds of options for debugging this, but you have to do something.

Of course it's also possible that, once you fix the first one, you'll immediately run into the second one. But now you know how to deal with that to.


Also, note that cp037 is EBCDIC, one of the few encodings that Python knows about that isn't ASCII-compatible. In fact, '\xe3' is EBCDIC for the letter T.

Outras dicas

It seems that your result[2] is already unicode:

>>> u'\xe3'.decode('cp037')
Traceback (most recent call last):
   ...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 0: ordinal not in range(128)
>>> u'\xe3'.encode('cp037')
'F'

In fact, as pointed out @abarnert in comments, in python 2.x decode being called for unicode object is performed in two steps:

  • encoding to string with sys.getdefaultencoding(),
  • then decoding back to unicode

i.e., you statement is translated as:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> u'\xe3'.encode('ascii').decode('cp037')

and the error you get is from the first part of expression, u'\xe3'.encode('ascii')

All right, so as @abarnert established, you don't really have a Unicode problem, per se. The Unicode only enters the picture when trying to print. After looking at your data, I can see that there is actually not just EBCDIC character data in there, but arbitrary binary data as well. The data definitely seems columnar, so what we probably have here is a bunch of subfields all packed into the field called F00002 in your example. RPG programmers would refer to this as a data structure; it's akin to a C struct.

The F00001 and K00001 columns probably worked fine because they happen to contain only EBCDIC character data.

So if you want to extract the complete data from F00002, you'll have to find out (via documentation or some person who has the knowledge) what the subfields are. Normally, once you've found that out, you could just use Python's struct module to quickly and simply unpack the data, but since the data comes from an IBM i, you may be faced with converting its native data types into Python's types. (The most common of these would be packed decimal for numeric data.)

For now, you can still extract the character portions of F00002 by decoding as before, but then explicitly choosing a new encoding that works with your output (display or file), as @abarnert suggested. My recommendation is to write the values to a file, using result[2].decode('cp037').encode('utf-8') (which will produce a bunch of clearly not human-readable data interspersed with the text; you may be able to use that as-is, or you could use it to at least tell you where the text portions are for further processing).


Edit:

We don't have time to do all your work and research for you. Things you need to just read up on and work out for yourself:

  • IBM's packed decimal format (crash course: each digit takes up 4 bits using basic hexadecimal; with an additional 4 bits on the right for the sign, which is 'F' for positive and 'D' for negative; the whole thing zero-padded on the left if needed to fill out a whole number of bytes; decimal place is implied)
  • IBM's zoned decimal format (crash course: each digit is 1 byte and is identical to the EBCDIC representation of the corresponding character; except that on the rightmost digit, the upper 4 bits are used for the sign, 'F' for positive and 'D' for negative; decimal place is implied)
  • Python's struct module (doesn't automatically handle the above types; you have to use raw bytes for everything (type 's') and handle as needed)
  • Possibly pick up some ideas (and code) for handling IBM packed and zoned decimals from the add-on api2 module for iSeriesPython 2.7 (in particular, check out the iSeriesStruct class, which is a subclass of struct.Struct, keeping in mind that the whole module is designed to be running on the iSeries, using iSeriesPython, and thus is not necessarily usable as-is from regular Python communicating with the iSeries via pyodbc).
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top