base64 encoding unicode strings in python 2.7

https://stackoverflow.com//questions/9572274

07-12-2019
|

Question

I have a unicode string retrieved from a webservice using the requests module, which contains the bytes of a binary document (PCL, as it happens). One of these bytes has the value 248, and attempting to base64 encode it leads to the following error:

In [68]: base64.b64encode(response_dict['content']+'\n')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
C:\...\<ipython-input-68-8c1f1913eb52> in <module>()
----> 1 base64.b64encode(response_dict['content']+'\n')

C:\Python27\Lib\base64.pyc in b64encode(s, altchars)
     51     """
     52     # Strip off the trailing newline
---> 53     encoded = binascii.b2a_base64(s)[:-1]
     54     if altchars is not None:
     55         return _translate(encoded, {'+': altchars[0], '/': altchars[1]})

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 272: ordinal not in range(128)

In [69]: response_dict['content'].encode('base64')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
C:\...\<ipython-input-69-7fd349f35f04> in <module>()
----> 1 response_dict['content'].encode('base64')

C:\...\base64_codec.pyc in base64_encode(input, errors)
     22     """
     23     assert errors == 'strict'
---> 24     output = base64.encodestring(input)
     25     return (output, len(input))
     26

C:\Python27\Lib\base64.pyc in encodestring(s)
    313     for i in range(0, len(s), MAXBINSIZE):
    314         chunk = s[i : i + MAXBINSIZE]
--> 315         pieces.append(binascii.b2a_base64(chunk))
    316     return "".join(pieces)
    317

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 44: ordinal not in range(128)

I find this slightly surprising, because 248 is within the range of an unsigned byte (and can be held in a byte string), but my real question is: what is the best or right way to encode this string?

My current work-around is this:

In [74]: byte_string = ''.join(map(compose(chr, ord), response_dict['content']))

In [75]: byte_string[272]
Out[75]: '\xf8'

This appears to work correctly, and the resulting byte_string is capable of being base64 encoded, but it seems like there should be a better way. Is there?

La solution

Since you are working with binary data, I'm not sure that it's a good idea to use the utf-8 encoding. I guess it depends on how you intend to use the base64 encoded representation. I think it would probably be better if you can retrieve the data as a bytes string and not a unicode string. I have never used the requests library, but browsing the documentation suggests that it is possible. There are sections talking about "Binary Response Content" and "Raw Response Content".

Autres conseils

You have a unicode string which you want to base64 encode. The problem is that b64encode() only works on bytes, not characters. So, you need to transform your unicode string (which is a sequence of abstract Unicode codepoints) into a byte string.

The mapping of abstract Unicode strings into a concrete series of bytes is called encoding. Python supports several encodings; I suggest the widely-used UTF-8 encoding:

byte_string = response_dict['content'].encode('utf-8')

Note that whoever is decoding the bytes will also need to know which encoding was used to get back a unicode string via the complementary decode() function:

# Decode
decoded = byte_string.decode('utf-8')

A good starting point for learning more about Unicode and encodings is the Python docs, and this article by Joel Spolsky.

I would suggest first encoding it to something like UTF-8 before base64 encoding:

In [12]: my_unicode = u'\xf8'

In [13]: my_utf8 = my_unicode.encode('utf-8')

In [15]: base64.b64encode(my_utf8)
Out[15]: 'w7g='

It should be possible to get the response as binary bytes and skip the decoding and encoding steps entirely. There's always a possibility that requests will choose an encoding that loses some data or errors out in the round trip.

This part of the docs called "Binary Response Content" seems to fit your problem perfectly.

If it's binary data...why encode/decode at all? Specially the "base64.encodestring" part. Below is how I encode images into base64 for adding directly into my python code instead of having extra files. 2.7.2 btw

import base64
iconfile = open("blah.icon","rb")
icondata = iconfile.read()
icondata = base64.b64encode(icondata)

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow